In recent years, face recognition systems have achieved extremely high levels of performance, opening the door to a wider range of applications where reliability levels were previously prohibitive to consider automation. This is mainly due to the adoption of deep learning techniques in computer vision. The most widely adopted paradigm is to train a $f: \mathcal{X} \rightarrow \mathbb{R}^d$ which, from a given image $im \in \mathcal{X}$, extracts a feature vector $z \in \mathbb{R}^d$ which synthesizes the relevant features of $im$.
The recognition phase then consists, from two images $im_1, im_2$, in predicting whether or not they correspond to the same identity. This is done from the extracted features $z_1, z_2$.
In this data challenge, the goal is to train a machine learning model that, given a vector $[z_1, z_2]$ consisting of the concatenation of two patterns $z_1$ and $z_2$, predicts whether or not these two images match the same identity.
The train set consists of two files train_data.npy and train_labels.txt
The train_data.npy file contains one observation per line, which consists of the concatenation of two templates, each of dimension 48
The file train_labels.npy contains two classes labeled per line that indicate whether the image pair matches the same identity:
1 => image pairs belonging to the same identity0 => image pairs not belonging to the same identityFor the evaluation of the performance of the models, the idea is to minimize the sum of the rate of false positives rate FPR and the rate of false negatives rate FNR. The performance score of the model is calculated using the following equation.
$score = 1 - (FPR + FNR)$
### Install requirements ###
#!pip install featurewiz
#!pip install scikit_optimize
#!pip3 install catboost
### Data transformation libs ###
import numpy as np
import pandas as pd
### Viz libs ###
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
import seaborn as sns
from statsmodels.graphics.tsaplots import plot_acf,plot_pacf
### Features selection libs ###
from sklearn.decomposition import PCA
from sklearn.feature_selection import RFE
from sklearn.linear_model import ElasticNet
from itertools import product
from featurewiz import featurewiz
### Models selection libs ###
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
from skopt import BayesSearchCV
### Metrics Evaluation libs ###
from sklearn.metrics import accuracy_score,f1_score,roc_auc_score,confusion_matrix,roc_curve
from sklearn.metrics import make_scorer
### ML libs ###
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV,train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import VotingClassifier
### Deep Learning libs ###
import tensorflow as tf
from tensorflow import keras
### options ###
np.random.seed(seed=42)
pd.set_option('max_columns', 100)
pd.set_option('max_rows', 100)
def extract_labels(txt_file):
"""
Extraction des labels from a text file
---- PARAMETERS ----
Input : text file (.txt)
Return : labels as numpy format
"""
with open(txt_file) as file:
lines = file.readlines()
y = []
for elem in lines:
label = int(elem[0])
y.append(label)
y = np.array(y)
return y
train_data and train_labels¶X, y = np.load("train_data.npy"), extract_labels("train_labels.txt")
train_data to dataframe¶X_copied = X.copy()
X_dataframe = pd.DataFrame(X_copied)
train_labels to dataframe¶y_copied = y.copy()
y_dataframe = pd.DataFrame(y_copied)
train_data¶train_data¶X_dataframe.head(10)
By visualizing the dataframe we realize that most of the features seem to be quantitative while some features seem to be binary such as col_8, col_94 or col_95.
Moreover some features seem to be between 0 and 1.
However, these hypotheses must be verified by a data mining analysis.
columns_list = list()
for i in range(X_dataframe.shape[1]):
name_col = "col_"+str(i)
columns_list.append(name_col)
X_dataframe.columns = columns_list
X_dataframe
train_dataframe description¶print("#### X_dataframe Description #### \n")
print("DIMENSION")
print("- " + str(X_dataframe.shape[1]) + ' features\n' + "- " + str(X_dataframe.shape[0]) + ' observations\n')
print("COLUMNS TYPES")
print(X_dataframe.dtypes)
print("\nMISSING VALUES")
print('- ' + str(X_dataframe.isna().sum().sum()) + ' missing values\n')
print("BINARY VARIABLES")
for i in range(X_dataframe.shape[1]):
X_unique = X_dataframe["col_"+str(i)].unique().tolist()
if len(X_unique) == 2:
print("- col_" + str(i) + " is a binary variables : ", X_unique)
We notice that all the columns are of type float.
The dataset is composed of 96 features for a total of 297 232 observations.
We notice that the number of features is very important close to (100 features). We can easily suppose that all these features will not be relevant for the training part of a machine learning model. Indeed, since machine learning has no understanding of causality, the models try to map any feature included in their dataset to the target variable, even if there is no causal relationship. This can lead to inaccurate and erroneous models. In fact having too many features can confuse certain machine learning algorithms such as clustering algorithms. To do so, it will be necessary to apply dimensionality reduction tools to reduce the cost of training the model and to solve complex problems with simple models.
Moreover the dataset does not contain any missing values (NaN values)
Finally, we notice that the 20th column and the 68th column of the dataset are binary variables with the value {0.0, 255.0}.
1 or 0¶X_dataframe.loc[X_dataframe['col_20'] == 0.0, 'col_20'] = 0
X_dataframe.loc[X_dataframe['col_20'] == 255.0, 'col_20'] = 1
X_dataframe['col_20'] = X_dataframe['col_20'].astype(int)
X_dataframe.loc[X_dataframe['col_68'] == 0.0, 'col_68'] = 0
X_dataframe.loc[X_dataframe['col_68'] == 255.0, 'col_68'] = 1
X_dataframe['col_68'] = X_dataframe['col_68'].astype(int)
X_data = pd.DataFrame(X_dataframe[["col_0", "col_48","col_47", "col_95"]])
X_data
As indicated in the statement of the data challenge, the dataframe is constituted by the concatenation of templates each of which has a dimension of 48.
By visualizing the first and the last column of each template we notice that the order of magnitude between the 2 templates seem to be verified.
The 1st template is a concatenation of the columns col_0 to col_47.
The 2nd template corresponds to the concatenation of the columns col_48 to col_95.
So the dataframe is compliant to the data challenge description
X_template_1 = X_dataframe.iloc[:,0:48]
X_template_2 = X_dataframe.iloc[:,48:96]
template_1¶print("--------------- TEMPLATE 1 DESCRIPTION ---------------\n")
round(X_template_1.describe(),2)
template_2¶print("--------------- TEMPLATE 2 DESCRIPTION ---------------\n")
round(X_template_2.describe(),2)
Overall, the features seem to be quite clearly heterogeneous. Indeed, the scales of the quantitative features seem to vary from one variable to another. This can be observed by observing the variations of the value of the standard deviation for each features. This trend can be observed both for the dataframe of template_1 and for the dataframe of template_2. For example, the template_1 variable, col_0 has a standard deviation of 50.11 while the template_1 variables col_2 and col_7 have a standard deviation of 0.04 and 14.85 respectively. Therefore, the variable col_0 has a much higher value than the other features of the dataset. However, we notice that globally for most of the variables, the standard deviation fluctuates around 0.
Furthemore, the description of the dataset of the 2 templates seems to present a rather clear homogeneity. By comparing side by side, the values of the standard deviation and the average we notice that they are quite close. In any case at least the mathematical order of magnitude and the amplitude seems to be confirmed.
As a comparison, to take only these two variables the variable col_7 of template_1 and the variable col_55 of template_2 have respectively a standard deviation and a mean very close:
Mean comparison : 35.22 (template_1) and 36.56 (template_2) => (1.34 difference)
Standard deviation comparison : 14.85 (template_1) and 14.66 (template_2) => (0.19 difference)
We can also note that some variables have rather high amplitudes of values with a standard deviation higher than 100 as it is the case for the variables col_15 to col_19 for template_1 and col_63 to col_67. This again legitimates a possible standardization of the features before their use for machine learning purposes.
To sum up, overall, there are clear quantitative aspects to these characteristics, but it is difficult to draw an intuitive interpretation. As it is, given that the scales of the variables are very disparate, there is no doubt that it is necessary to rescale so that the variables can be compared on a common scale. Before applying a machine learning algorithm it will be necessary to normalize the variables.
y_dataframe.head(10)
At first sight we notice that the labels seem to be binary. Indeed, the values taken by the target variable are 1 or 0. Now the next step is to confirm that by carrying out a description of the train labels dataset.
columns_list_labels = list()
for i in range(y_dataframe.shape[1]):
name_col_label = "label_"+str(i)
columns_list_labels.append(name_col_label)
y_dataframe.columns = columns_list_labels
y_dataframe
print("#### y_dataframe Description #### \n")
print("DIMENSION")
print("- " + str(y_dataframe.shape[1]) + ' features\n' + "- " + str(y_dataframe.shape[0]) + ' observations\n')
print("COLUMNS TYPES")
print(X_dataframe.dtypes)
print("\nMISSING VALUES")
print('- ' + str(y_dataframe.isna().sum().sum()) + ' missing values\n')
print("BINARY VARIABLES")
for i in range(y_dataframe.shape[1]):
y_unique = y_dataframe["label_"+str(i)].unique().tolist()
if len(y_unique) == 2:
print("- label_" + str(i) + " is a binary variables : ", y_unique)
The training labels dataset is composed of a single feature that corresponds to the target variable, (i.e. the variable that the machine learning model will try to predict).
We also notice that the train_labels datasets has the same number of rows (297232 rows). If this had not been the case, part of the training set data could not have been used to train the machine learning models.
Moreover train_labels do not have any missing values (i.e. NaN values).
Finally we can now confirm that the target variable is a binary variable composed of two classes {0, 1}. The values are of type python float64, thus it is necessary to convert them into integer (i.e. python int64).
train_labels from float to int¶y_dataframe['label_0'] = y_dataframe['label_0'].astype(int)
train_labels distribution analysis¶bar_plot = sns.histplot(data=y_dataframe, x="label_0").set_title('Labels Distribution', fontsize = 15)
counter_values = y_dataframe['label_0'].value_counts()
print(counter_values)
Concerning the distribution of the train labels we notice that the distribution is perfectly balanced.
This is a good point because we will not need to use resampling techniques such as SMOTE to work with a balanced dataset
train sample dataset contrains 90% of train datasetvalidation sample dataset contains 10% train datasetX_train, X_valid, y_train, y_valid = train_test_split(X_dataframe, y_dataframe, test_size=0.1)
correlations = X_train.corr()
fig_1 = plt.figure(figsize=(30, 20))
sns.heatmap(correlations, xticklabels=correlations.columns, yticklabels=correlations.columns,
cmap='YlGnBu')
plt.title('Heatmap for features correlation\n', fontsize=30)
plt.show()
This heatmap allows to visualize the covariance of the different features of the dataset. If two features are strongly correlated to each other, they will have a similar effect on the target variable. Therefore it will not be necessary to include the two strongly correlated features durint the training phase of a machine learning model. Thus it is appropriate to remove one of them without having a negative impact on the accuracy of the model predictions
Given the number of features in our dataset, it is difficult to visually determine each of the correlated variables. However, we can clearly distinguish 2 phenomena that emerge from the heatmap. Indeed, this heatmap reveals two striking phenomena.
First, we notice that some features are positively correlated and others are negatively correlated.
Secondly, we can see that the features specific to each template seem to be more positively correlated between them. On the other hand, when we cross the 48 features of template_1 with the 48 features of template_2 we notice that they tend to be more negatively correlated.
It is now necessary to verify these hypotheses by visualizing more closely the correlation.
template_1¶correlations_temp1 = X_train.iloc[:,0:48].corr()
fig_1 = plt.figure(figsize=(30, 20))
sns.heatmap(correlations_temp1, xticklabels=correlations_temp1.columns, yticklabels=correlations_temp1.columns,
cmap='YlGnBu', annot=True, linewidths=.5)
plt.title('Heatmap for Template 1 features correlation\n', fontsize=30)
plt.show()
Through this heatmap we are better able to analyze with finer granularity, the features correlation of template_1.
By focusing our analysis on the 48 features of template_1 we can show the value of the correlation between each feature (outside the diagonal).
Through the observation of this heatmap, the first hypothesis that we previously stated seems to be confirmed . Indeed, we can see that the features are more positively correlated between them. For example the features {col_17, col_20} with a correlation coefficient of 0.76 are strongly positively correlated.
Moreover, we also observe that some features are strongly negatively correlated even though this phenomenon occurs less frequently. For example the features {col_2, col_1} and {col_3, col_1} with respectively a correlation coefficient of -0.75 and -0.68 are strongly negatively correlated.
template_2¶correlations_temp2 = X_train.iloc[:,48:].corr()
fig_2 = plt.figure(figsize=(30, 20))
sns.heatmap(correlations_temp2, xticklabels=correlations_temp2.columns, yticklabels=correlations_temp2.columns,
cmap='YlGnBu', annot=True, linewidths=.5)
plt.title('Heatmap for Template 2 features correlation\n', fontsize=30)
plt.show()
As the previous heatmap specific to template_1, through this heatmap we are more able to analyze with finer granularity, the correlation of the variables of template_2.
In a similar way to the heatmap of template_1, by focusing our analysis on the 48 features of template_2 we can observe the value of the correlation between each feature (outside the diagonal).
Through the observation of this heatmap of template_2, it is more complicated to validate the first hypothesis that we made previously. Indeed, the difference in proportion between the number of positively correlated features and the number of negatively correlated features seems to be less obvious.
Some features are strongly positively correlated like the features {col_67, col_65} with a correlation coefficient a 0.85. On the other hand, features like {col_68, col_64} with a correlation coefficient of -0.85 are strongly negatively correlated.
Anyway the distribution of the correlated features does not seem to follow a real logic. Unfortunately at this stage it is not possible to provide a viable interpretation of the correlation of some variables as it is so random.
The objective of PCA is to simplify the model while retaining as much information as possible. To do this, in order to keep as much information as possible, PCA will try to find the best axis to project the data in a smaller dimension by maximizing the variance. The axis that will maximize the variance is the axis that is closest to all the data points.
nb = 50000 # analyse the first 50 000 observations
X_pca = X_train[:nb].to_numpy()
y_pca = y_train[:nb].to_numpy()
mask = (y_pca==1).flatten()
pca = PCA(n_components=2)
pca_components = pca.fit_transform(X_pca)
label1 = pca_components[mask]
label2 = pca_components[~mask]
fig_1 = plt.figure(figsize=(20,16))
plt.scatter(label1[:,0], label1[:,1], c="lightsalmon", alpha=0.5, s=2, label="y = 1")
plt.scatter(label2[:,0], label2[:,1], c="lightskyblue", alpha=0.5, s=2, label="y = 0")
plt.xlim([-400, 400])
plt.xlabel("component 1", fontsize = 15)
plt.ylim([-230, 350])
plt.ylabel("component 2", fontsize = 15)
plt.title('PCA on the model features\n', fontsize = 30)
plt.grid(True)
plt.legend(fontsize = 15)
plt.show()
At first glance the data does not seem to be separable at all.
We expected that there would be clearly 2 distinct compact groups for each label value. Indeed, whether it is for the class y=0 or class y=1, these are very large and spread out.
However, the separation of the data of each class is not clearly visible. Hence, the PCA reveals that the most difficult part of this data challenge will be to correctly identify the pairs of matching images, which will tend to be identified as non-matching pairs.
Throughout this data exploration phase, we can easily see that the features of each template seem to follow a very similar structure. This suggests that the data set of features col_0 to col_47 (features template_1) or col_48 to col_95 (features template_2) uniquely represent an image. Therefore we can deduce that the features describing template_1 and the features describing template_2 should logically not share the same values.
Although this is a rather strong assumption, it is necessary to check the accuracy of this assumption by performing a duplicate analysis in the dataset.
Indeed, if it proves to be correct and the same pairs of images appear on the training set and the test set, the information on the training set can be directly exploited to produce predictions on the test set.
# drop duplicated elements in the train set
# if at least two rows contains the same values one of these rows is rejected from the train set
nb_template1_train = X_train.iloc[:,0:48].drop_duplicates()
nb_template2_train = X_train.iloc[:,48:].drop_duplicates()
print("---- DISTINCTS VALUES TRAIN SET ----")
print("Train set size :", X_train.shape[0])
print("- " + str(nb_template1_train.shape[0]) + " distinct value(s) for template 1 in the train set (i.e " + str(round((nb_template1_train.shape[0] / X_train.shape[0])*100, 3)) + "% of train set size)")
print("- " + str(nb_template2_train.shape[0]) + " distinct value(s) for template 2 in the train set (i.e " + str(round((nb_template2_train.shape[0] / X_train.shape[0])*100, 3)) + "% of train set size)\n")
# drop duplicated elements in the validation set
# if at least two rows contains the same values one of these rows is rejected from the validation set
nb_template1_valid = X_valid.iloc[:,0:48].drop_duplicates()
nb_template2_valid = X_valid.iloc[:,48:].drop_duplicates()
print("---- DISTINCTS VALUES VALIDATION SET ----")
print("Validation set size :", X_valid.shape[0])
print("- " + str(nb_template1_valid.shape[0]) + " distinct values for template 1 in the validation set (i.e " + str(round((nb_template1_valid.shape[0] / X_valid.shape[0])*100, 3)) + "% of validation set size)")
print("- " + str(nb_template2_valid.shape[0]) + " distinct values for template 2 in the validation set (i.e " + str(round((nb_template2_valid.shape[0] / X_valid.shape[0])*100, 3)) + "% of validation set size)\n")
# check if there are template 1 on the train set that are the same as template 1 on the test set.
list_template1_train = np.asarray(X_train.iloc[:,0:48].drop_duplicates().values.tolist())
list_template1_valid = np.asarray(X_valid.iloc[:,0:48].drop_duplicates().values.tolist())
template1_train_set = set([tuple(x) for x in list_template1_train])
template1_valid_set = set([tuple(x) for x in list_template1_valid])
intersection_template1 = np.array([x for x in template1_train_set & template1_valid_set])
nb_intersection = intersection_template1.shape[0]
print("\n---- SAME TEMPLATE 1 BETWEEN TRAIN SET AND VALIDATION SET ----")
print("- " + str(nb_intersection) + " template-1 value(s) in the valid set that are also part of the train set.")
# check if there are template 2 on the train set that are the same as template 2 on the test set.
list_template2_train = np.asarray(X_train.iloc[:,48:].drop_duplicates().values.tolist())
list_template2_valid = np.asarray(X_valid.iloc[:,48:].drop_duplicates().values.tolist())
template2_train_set = set([tuple(x) for x in list_template2_train])
template2_valid_set = set([tuple(x) for x in list_template2_valid])
intersection_template2 = np.array([x for x in template2_train_set & template2_valid_set])
nb_intersection = intersection_template2.shape[0]
print("\n---- SAME TEMPLATE 2 BETWEEN TRAIN SET AND VALIDATION SET ----")
print("- " + str(nb_intersection) + " template-2 value(s) in the valid set that are also part of the train set.")
Concerning the training dataset we notice that almost all the feature values of template_1 and template_2 are distinct. Indeed, 99.98 % of the training set data describing template_1 are distinct. Moreover 99.99 % of the data of the training set describing template_2 are distinct. Therefore, we can say that our dataset contains almost no duplicate data. This means that the set of images in template_1 and template_2 have many images that are different from each other.
Moreover, the important thing now is to know if the sets of images template_1 and template_2 contained in the training dataset also contain images identical to the validation dataset. If this is the case, then this information can be exploited with profit. In our case, the results obtained show that on the one hand, 4 images of the template_1 set of the validation dataset are also contained in the training dataset and on the other hand 1 image of the template_2 set of the validation dataset are also contained in the training dataset. Therefore, the strategy of identifying and working with images common to the training and test sets does not seem to be profitable and efficient here. In conclusion, this strategy will not be pursued.
Through this exploratory phase, it is appropriate to centralize the different pre-processing operations that we have previously carried out within a unique function.
def preprocessing():
"""
Preprocessed dataframe
---- PARAMETERS ----
Input : None
Return : train dataframe cleaned, train labels cleaned
"""
X, y = np.load("train_data.npy"), extract_labels("train_labels.txt")
# convert X to DataFrame
X_copied = X.copy()
X_dataframe = pd.DataFrame(X_copied)
# convert y to DataFrame
y_copied = y.copy()
y_dataframe = pd.DataFrame(y_copied)
# rename X_dataframe columns
columns_list = list()
for i in range(X_dataframe.shape[1]):
name_col = "col_"+str(i)
columns_list.append(name_col)
X_dataframe.columns = columns_list
# rename y_dataframe columns
columns_list_labels = list()
for i in range(y_dataframe.shape[1]):
name_col_label = "label_"+str(i)
columns_list_labels.append(name_col_label)
y_dataframe.columns = columns_list_labels
# preprocessing X_dataframe
X_dataframe.loc[X_dataframe['col_20'] == 0.0, 'col_20'] = 0
X_dataframe.loc[X_dataframe['col_20'] == 255.0, 'col_20'] = 1
X_dataframe['col_20'] = X_dataframe['col_20'].astype(int)
X_dataframe.loc[X_dataframe['col_68'] == 0.0, 'col_68'] = 0
X_dataframe.loc[X_dataframe['col_68'] == 255.0, 'col_68'] = 1
X_dataframe['col_68'] = X_dataframe['col_68'].astype(int)
# convert labels value from float to int
y_dataframe['label_0'] = y_dataframe['label_0'].astype(int)
# select index duplicated row for template 1
X_template_1 = X_dataframe.iloc[:,0:48]
template1_dup_rows = X_template_1[X_template_1.duplicated(keep = "first")]
template1_dup_rows_tuple_idx = template1_dup_rows.groupby(list(template1_dup_rows)).apply(lambda x: tuple(x.index)).tolist()
template1_dup_row_idx = [item for t in template1_dup_rows_tuple_idx for item in t]
# select index duplicated row for template 2
X_template_2 = X_dataframe.iloc[:,48:96]
template2_dup_rows = X_template_2[X_template_2.duplicated(keep = "first")]
template2_dup_rows_tuple_idx = template2_dup_rows.groupby(list(template2_dup_rows)).apply(lambda x: tuple(x.index)).tolist()
template2_dup_row_idx = [item for t in template2_dup_rows_tuple_idx for item in t]
dup_idx_list = template1_dup_row_idx + template2_dup_row_idx
# drop duplicated rows for train dataset
X_dataframe.drop(dup_idx_list, axis=0, inplace=True)
# drop duplicated rows for train labels dataset
y_dataframe.drop(dup_idx_list, axis=0, inplace=True)
return X_dataframe, y_dataframe
X_df_cleaned, y_df_cleaned = preprocessing()
# drop duplicated elements in the train set
# if at least two rows contains the same values one of these rows is rejected from the train set
nb_template1_train = X_df_cleaned.iloc[:,0:48].drop_duplicates()
nb_template2_train = X_df_cleaned.iloc[:,48:].drop_duplicates()
print("---- Checking DISTINCTS VALUES X_DataFrame after preprocessing ----")
print("Train set size :", X_df_cleaned.shape[0])
print("- " + str(nb_template1_train.shape[0]) + " distinct value(s) for template 1 in the X_DataFrame (i.e " + str(round((nb_template1_train.shape[0] / X_df_cleaned.shape[0])*100, 3)) + "% of train set size)")
print("- " + str(nb_template2_train.shape[0]) + " distinct value(s) for template 2 in the X_DataFrame (i.e " + str(round((nb_template2_train.shape[0] / X_df_cleaned.shape[0])*100, 3)) + "% of train set size)\n")
The preprocessing phase seems to have done the job. We can see that all the duplicate images for template_1 and template_2 have been removed. We obtain a dataset cleaned of all duplicate elements
bar_plot = sns.histplot(data=y_df_cleaned, x="label_0").set_title('Labels Distribution after Preprocessing', fontsize=15)
counter_values_cleaned = y_df_cleaned['label_0'].value_counts()
print(counter_values_cleaned)
After the preprocessing phase, we notice that the distribution of the label values is no longer perfectly balanced. Indeed, we can observe that the label y=1 is slightly more represented than the label y=0 but this imbalance remains annecdotal compared to the size of the dataset.
train set length and train_labels set length¶print("Train dataset length: ", len(X_df_cleaned))
print("Train labels dataset length: ", len(y_df_cleaned))
We can see that the preprocessing has been applied because the train dataset and the train labels dataset have the same length. Tha'ts a good point.
We can now elaborate the strategic approach we are going to address for the implementation and execution of machine learning algorithms
We notice that the dataset has a lot of features nearly close to 100 features. We also noticed earlier with the correlation analysis of the variables that some variables are particularly strong correlated. Therefore it is advisable to remove the weakly correlated variables in order to simplify the model and reduce the cost of learning the model. However, this may not be enough, it is possible that once the correlated variables are discarded from the dataset, it is necessary to further restrict the number of features in order to work only with a subset of relevant variables. This is the challenge of the features selection step. We try to minimize the loss of information coming from the deletion of all the other variables while simplifying the classification task that the machine learning model will have to perform.
The goal of feature selection in machine learning is to find the best set of features to build useful models of the studied phenomena.
To do that the strategy is to apply 3 feature selection algorithms which are :
FeaturewizElasticNetRandom Forest Feature ImportanceBasis on the features selected by these 3 algorithms I select the 25 best features for each algorithms. Then I select features that are selected by at least 2 different algorithms will be use to train the machine learning models.
In fact, the idea is to compare the results of each algorithm, in order to select the 25 most important features of our dataset in the most robust way.
X_df_cleaned¶list_binary_columns = list()
for i in X_df_cleaned.columns:
if len(X_df_cleaned[i].unique()) < 3:
list_binary_columns.append(i)
list_binary_columns
X_df_copied = X_df_cleaned.copy()
X_raw_df_cleaned = pd.DataFrame(X_df_copied)
for col in X_df_cleaned.columns:
if col not in list_binary_columns:
X_df_cleaned[col] = (X_df_cleaned[col] - X_df_cleaned[col].mean()) / X_df_cleaned[col].std()
X_df_cleaned
correlations = X_df_cleaned.corr()
fig_1 = plt.figure(figsize=(30, 20))
sns.heatmap(correlations, xticklabels=correlations.columns, yticklabels=correlations.columns,
cmap='YlGnBu')
plt.title('Heatmap for features correlation before feature selection\n', fontsize=30)
plt.show()
df_corr = X_df_cleaned.corr()
l = list()
for i in X_df_cleaned.columns:
for j in X_df_cleaned.columns:
if abs(df_corr.loc[i,j]) > 0.6 and df_corr.loc[i,j] != 1.0:
if [i, j, df_corr.loc[i,j]] not in l and [j, i, df_corr.loc[j,i]] not in l :
l.append([i, j, df_corr.loc[i,j]])
df_val = pd.DataFrame(l, columns=['feature1','feature2','val'])
# selecting negative correlated features
df_negative_corr = df_val[df_val['val'] < 0]
# selecting positive correlated features
df_positive_corr = df_val[df_val['val'] >= 0]
print("---- Negative Correlated Features (Less than -0.6)----\n")
print(df_negative_corr)
print("\n\n\n---- Positive Correlated Features (Higher than +0.6) ----\n")
print(df_positive_corr)
df_val = df_val.sort_values(by=['val'], ascending = False)
print("\n\n\n---- Summary correlated features ----")
print("- Negative correlated features : ", df_negative_corr.shape[0])
print("- Positive correlated features : ", df_positive_corr.shape[0])
On the 96 features that compose the dataset, we have 17 features that are negatively correlated and 46 features that are positively correlated
Featurewiz¶Featurewiz features selection techniques¶Featurewiz is a python library that can find the best features in our dataset if we give it a dataframe and the name of the target variable. It will do the following:
It will automatically remove highly correlated features (the limit is set to 0.5 but we can change it in the input argument).
If several features are correlated with each other, which one to delete? In such a conflict, the algorithm will delete the feature with the lowest mutual information score.
Finally the algorithm will do a recursive selection of features using the XGBoost algorithm to find the best features using XGBoost.
# Join train data and train labels in the same dataframe
data = pd.concat([X_df_cleaned, y_df_cleaned], axis=1).reindex(X_df_cleaned.index)
data
# specify the target variables for featurewiz
target = 'label_0'
# Apply featurewiz method for features selection
features, train = featurewiz(data, target, corr_limit=0.5, verbose=2, sep=",",
header=0,test_data="", feature_engg="", category_encoders="")
selected_features_featurewiz = features
df_selected_features_featurewiz = pd.DataFrame(selected_features_featurewiz, columns=['Featurewiz Selected Features'])
df_selected_features_featurewiz
Featurewiz has selected 25 variables. The next step to verify the job of featurewiz by analyzing the new correlation matrix.
X_df_cleaned_features_selected_featurewiz = X_df_cleaned[selected_features_featurewiz]
correlations = X_df_cleaned_features_selected_featurewiz.corr()
fig_1 = plt.figure(figsize=(30, 20))
sns.heatmap(correlations, xticklabels=correlations.columns, yticklabels=correlations.columns,
cmap='YlGnBu', annot=True, linewidths=.5)
plt.title('Heatmap for features correlation after feature selection with featurewiz\n', fontsize=30)
plt.show()
We notice that all correlated variables with a correlation coefficient of +0.5 or -0.5 have been removed. In conclusion, featurewiz seems to have done its job well since we have obtained a list of features that are not very correlated anymore
featurewiz¶nb = 50000 # analyze the first 50 000 observations
X_raw_df_cleaned_features_selected_featurewiz = X_raw_df_cleaned[selected_features_featurewiz]
X_pca = X_raw_df_cleaned_features_selected_featurewiz[:nb].to_numpy()
y_pca = y_df_cleaned[:nb].to_numpy()
mask = (y_pca==1).flatten()
pca = PCA(n_components=2)
pca_components = pca.fit_transform(X_pca)
label1 = pca_components[mask]
label2 = pca_components[~mask]
fig_1 = plt.figure(figsize=(20,16))
plt.scatter(label1[:,0], label1[:,1], c="lightsalmon", alpha=0.5, s=2, label="y = 1")
plt.scatter(label2[:,0], label2[:,1], c="lightskyblue", alpha=0.5, s=2, label="y = 0")
plt.xlim([-250, 300])
plt.xlabel("component 1", fontsize = 15)
plt.ylim([-250, 300])
plt.ylabel("component 2", fontsize = 15)
plt.title('PCA on the model features\n', fontsize = 30)
plt.grid(True)
plt.legend(fontsize = 15)
plt.show()
After the featurewiz processing, the PCA shows us that we get two poles with 3 quite distinct clusters. However, these seem to be spread out. Now it is relatively easier to find a separation of the data of each class.
ElasticNet¶ElasticNet combine the two types of regularizations. It contains both $L1$ and $L2$ as penalty terms. I choose this features selection techniques because, it works better than Ridge and Lasso regression for most test cases.
%%time
def rmse_cv(model):
rmse= np.sqrt(-cross_val_score(model, X_df_cleaned.iloc[:10000,:], y_df_cleaned.iloc[:10000,:], scoring="neg_mean_squared_error", cv = 5))
return(rmse)
# Define parameters of elastic net
alphas = [0.0005, 0.001, 0.01, 0.03, 0.05, 0.1]
l1_ratios = [0.9, 0.8, 0.7, 0.5, 0.3, 0.2, 0.1]
# Run ElasticNet on several parameters
cv_elastic = [rmse_cv(ElasticNet(alpha = alpha, l1_ratio=l1_ratio)).mean()
for (alpha, l1_ratio) in product(alphas, l1_ratios)]
plt.rcParams['figure.figsize'] = (12.0, 6.0)
idx = list(product(alphas, l1_ratios))
p_cv_elastic = pd.Series(cv_elastic, index = idx)
p_cv_elastic.plot(title = "Validation - Just Do It")
plt.xlabel("alpha - l1_ratio")
plt.ylabel("rmse")
# Zoom in to the first 10 parameter pairs
plt.rcParams['figure.figsize'] = (12.0, 6.0)
idx = list(product(alphas, l1_ratios))[:10]
p_cv_elastic = pd.Series(cv_elastic[:10], index = idx)
p_cv_elastic.plot(title = "Validation - Just Do It")
plt.xlabel("alpha - l1_ratio")
plt.ylabel("rmse")
rmse score in function of ElasticNet parameters¶pd.DataFrame(p_cv_elastic, columns = ["rmse"])
rmse¶pd.DataFrame(p_cv_elastic).idxmin()
ElasticNet with the best parameters¶elastic = ElasticNet(alpha=0.001, l1_ratio=0.9)
elastic.fit(X_df_cleaned, y_df_cleaned)
coef = pd.Series(elastic.coef_, index = X_df_cleaned.columns)
print("Elastic Net picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")
imp_coef = pd.concat([coef.sort_values().head(10), coef.sort_values().tail(10)])
plt.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Coefficients in the Elastic Net Model")
Looking at this graph, we immediately notice that the features col_0 and col_48 seem to be the features having the most impact on the prediction of the target variable. In other words, these features seem to be the most revealing in our dataset and their degree of importance far exceeds that of all the other features. For the rest of the features, the difference in importance between each feature seems to be much less pronounced
In the following of this report we will analyze the first 25 features considered as the most important by the model.
# We select the first 25 most important features
selected_features_elasticNet = pd.DataFrame(coef.sort_values(ascending=False), columns = ['selected_features'])
selected_features_elasticNet = selected_features_elasticNet[:25]
selected_features_elasticNet.dropna(subset = ['selected_features'], inplace=True)
selected_features_elasticNet = selected_features_elasticNet.index.values
pd.DataFrame(selected_features_elasticNet.tolist(), columns = ["ElasticNet Features Selected"])
Random Forest Importance¶Random forest importance is a variable selection technique that uses the Random Forest model to represent the importance of features in a dataset to predict the target variable. The feature importance (variable importance) describes which features are relevant. The tree-based strategies used by random forests naturally rank by how well they improve the purity of the node, or in other words a decrease in the impurity (Gini impurity) over all trees. Nodes with the greatest decrease in impurity happen at the start of the trees, while notes with the least decrease in impurity occur at the end of trees. Thus, by pruning trees below a particular node, we can create a subset of the most important features.
%%time
X_train, X_test, y_train, y_test = train_test_split(X_df_cleaned.iloc[:50000,:], y_df_cleaned.iloc[:50000,:], test_size=0.25, random_state=12)
rf = RandomForestClassifier(n_estimators=500, random_state=12)
rf.fit(X_train, y_train)
importances = rf.feature_importances_
features_importances_df = pd.DataFrame({"Features": pd.DataFrame(X_df_cleaned).columns, "Importances": importances})
features_importances_df.set_index("Importances")
features_importances_df = features_importances_df.sort_values("Importances")
plt.figure(figsize=(100,100))
features_importances_df.plot.barh()
plt.xlabel("Features Importance", fontsize = 15)
plt.ylabel("Features", fontsize = 15)
plt.title("Random Forest Feature Importance", fontsize = 20)
The same observation that we made for the result of the ElasticNet model can be shared here. Indeed, we also notice that the features col_0 and col_48 are the most important variables. Overall, the first 5 features selected by the Random Forest Features Importance model are identical to the first 5 features selected by the ElasticNet model. These two models confirm each other
features_importances_df = features_importances_df.sort_values(ascending=False, by = "Importances")
selected_Features_RF = features_importances_df.iloc[:25,0].tolist()
pd.DataFrame(selected_Features_RF, columns = ["RF Selected Features"])
Let's analyze the 25 most important features selected by the 3 variable selection methods we implemented, namely :
FeaturewizElasticNetRandom Forest Features Importancedf_featurewiz = pd.DataFrame(selected_features_featurewiz, columns = ["Featurewiz Features"])
df_elasticNet = pd.DataFrame(selected_features_elasticNet, columns = ["ElasticNet Features"])
df_rf = pd.DataFrame(selected_Features_RF, columns = ["RF Features"])
df_summary_features_selected = pd.concat([df_featurewiz, df_elasticNet, df_rf], axis=1)
df_summary_features_selected
We notice that the features col_0, col_48 occupy for the 3 methods the top of the ranking of the features importance. Then the features col_78 and col_41 are found just behind the features mentioned just before which also occupy the top of the ranking. Overall, the 3 models seem to select globally the same top 25 most important features.
# Creation of a dataframe with the occurences of the features selected by the three models
features_list = list()
for i in range(0, len(selected_features_featurewiz)):
features_list.append(selected_features_featurewiz[i])
features_list.append(selected_features_elasticNet[i])
features_list.append(selected_Features_RF[i])
df_features_list = pd.DataFrame(features_list, columns = ["Features Selected"])
df_features_occurences = pd.DataFrame(df_features_list.value_counts(), columns = ["Count"])
df_features_occurences = df_features_occurences.reset_index()
df_features_occurences
best_selected_features = df_features_occurences.loc[(df_features_occurences["Count"] >= 2)]
best_selected_features = best_selected_features["Features Selected"].tolist()
print("---- Best selected features ----\n")
print(best_selected_features)
Through this data challenge, we have to propose a model that can solve a binary classification task. It is important to note that there is a wide range of algorithms that can provide a solution to a binary classification task. The idea here is not to try them all as this could prove to be a waste of time and not very profitable. On the other hand, it is advisable to select upstream the algorithms likely to answer best the classification problem to be solved. The strategy consists first of testing the classical approaches such as k-NN, SVM and Decision Tree. Then, it will be appropriate to apply the ensemblistic classification approaches because these methods should bring superior performances to the classical approaches. Finally, it will be necessary to test the performance of a neural network which could potentially bring even better performance than the set methods.
Below is a map of the algorithms that will be applied to try to solve the classification problem.
For each of the classification algorithms, it will be necessary to choose the best possible hyperparameters. Indeed, the choice of the hyperparameters is crucial to build the most efficient classification model possible. To do this with the python method GridSearchCV() allows by specifying upstream the list of values of each hyperparameter to test all the different combination of hyperparameter possible and thus retain the parameters of the best model that is to say those that minimizes the evaluation criterion namely the sum of false negatives and false positives. Once the values of these optimal hyperparameters are known, it is the time to train the model with these hyperparameters. In fact, the refinement of the model is the key to obtain the most efficient model for the binary classification task. However the search for the best hyperparameters can be very time consuming as the execution time of the GridSearchCV function depends very much on the possible combinations to be tested. Therefore, the search for hyperparameters will be done on a very small subsample of the training dataset. Since this sub-sample will be small, we will also do a cross evaluation to estimate the best parameters of the model. Indeed, This technique allows to avoid overfitting and to evaluate the performance of the model in a more robust way than a simple test run.
Finally, the model with the best performance on the test set will be trained on the whole dataset at our disposal and then perform the prediction on the X_test dataset.
Some comments on the choice of algorithms:
k-NN: Without a doubt, the k-NN algorithm is not the best algorithm to apply in our case of study because, we have a rather large dataset, yet k-NN does not scale very well because k-NN stores the entire dataset in memory to make a prediction. Therefore k-NN is not really suitable for our classification task. However, for pedagogical purposes, it can be good to implement a k-NN on a small subsample to obtain an order of magnitude of the performances with a relatively simple model.Decision Tree: Even if models such as the Decision Tree are part of the weak models, an upstream implementation of a Decision Tree can be useful to determine the optimal hyperparameters and then use these hyperparameters in an Adaboost model based on a Decision Tree. The advantage of the Decision Tree model is that it performs well on large volume datasets.SVM: Support Vector Machines were 10 years ago a state of the art model for classification tasks. For this reason, SVM is a serious candidate even if nowadays there are new, more advanced and more powerful techniques. However, the SVM algorithm suffers from the same scaling problem as the k-NN model. Therefore, like the k-NN algorithm, I will train the SVM model on a reduced subsample of the training dataset because otherwise the training time might be too much too long.Random Forests, Adaboost, XGBoost and Gradient Boosting are known to be particularly efficient models for classification tasks. Each of these approaches should be tested individually, taking care to refine the hyperparameters of each algorithm.Neural Networks can be very powerful tools and can even outperform ensemble approaches on the classification task. However, they need to be trained on large datasets, and they require a very large number of calculations. Therefore, on the one hand, the dataset may not be large enough to obtain convincing results and on the other hand, depending on the configuration of the Neural Networks, the training time may be long. Finally, the last one to take into account, concerns the low computational performance of my personal computer, therefore, the Neural Networks architecture will be quite simple (limited stacking of hidden layers) but may be too simple to obtain better results than those obtained with the ensemblistic models. However, it would be a shame not to test the performance of a neural network on our classification taskBefore run the training of a machine learning model, it is necessary to separate the training set into 3 sub-samples.
X_train, y_train} to train the modelX_valid, y_valid} : to discriminate among models trained on the train set (typically for the purpose of hyperparameter optmisation)X_test, y_test} to test the global performance of the modelFor this data challenge, we have the training dataset as well as the training labels and the test set. To form the validation sample, a subsample of data will be created from the training dataset.
Since the original training set is quite large, running cross-validation methods on it will often not be possible on a standard laptop like mine. For this reason, my approach will be flexible. Depending on the model and the optimization procedures used to fine tune the hyperparameters, I will be able to work on smaller subsamples.
For the evaluation of the performance of the models, the idea is to minimize the sum of the rate of false positives rate FPR and the rate of false negatives rate FNR. The performance score of the model is calculated using the following equation.
$score = 1 - (FPR + FNR)$
This score metric represents the ability of the model to correctly predict the data.
When training a machine learning model, the amount of FPR and FNR should be minimized. In other words, the training must be able to maximize the score metric.
The predictions of the algorithm must be submitted to the Data Challenge educational site so that the performance score of the model is known.
As shown on data exploration part, the data is very heterogenous and differently scaled. It may thus make sense to normalize it before using it for machine learning purposes. My preliminary attempts on normalisation produced however no improvement on the performances however. For this reason, I will typically not standardise the data, except on methodologies where this preprocessing is justified like k-NN and SVM because these methods use distance metrics in the train process. Indeed, for decision tree-based models, the normalization of the data will have no impact on the accuracy of the predictions generated by the algorithm.
GridSearchCV)¶def criterion_GridCV(y_true, y_pred):
CM = confusion_matrix(y_true, y_pred)
TN, TP = CM[0, 0], CM[1, 1]
FP, FN = CM[0, 1], CM[1, 0]
return 1 - (FP/(FP + TN) + FN/(FN + TP))
# specify the metric to maximize to GridSearchCV function
scoring_critetion = make_scorer(criterion_GridCV)
def criterion(y_pred, y_true):
CM = confusion_matrix(y_true, y_pred)
TN, TP = CM[0, 0], CM[1, 1]
FP, FN = CM[0, 1], CM[1, 0]
return FP/(FP + TN) + FN/(FN + TP)
X_train = X_df_cleaned[best_selected_features]
y_train = y_df_cleaned
train = pd.concat([X_train, y_train], axis=1)
train
X_train_raw = X_raw_df_cleaned[best_selected_features]
y_train = y_df_cleaned
train_raw = pd.concat([X_train_raw, y_train], axis=1)
train_raw
The implementation of the machine learnign algorithms will always follow the same steps :
GridSearchCV function. GridSearchCV function is to choose the best parameters of the algorithm that maximizes the score: $1-(FPR+FNR)$. To do this I will specify in the parameter scoring of GridSearchCV the name of the function that we defined previously thanks to the function make_scorer of the library sklearn to find the best parameters.GridSearchCV function.The machine learning approach I adopted follows the workflow below
In decision computing, Decision Trees are widely used to solve classification problems.
They use a hierarchical representation of the data structure in the form of sequences of decisions (tests) to predict a class.
Each individual (or observation), which is to be assigned to a class, is described by a set of variables that are tested in the nodes of the tree. Tests are performed in the inner nodes and decisions are made in the leaf nodes.
Each leaf node represents the output variable $y$. The internal nodes and the root node are the input variables.
Its construction is based on a recursive partitioning of the individuals using the data. This partitioning is done by a succession of cut nodes. The cutting of a node, and what characterizes it, is done with stopping rules cut-off conditions.
To determine a plausible value of Y for an individual whose values {$X_{1},...,X{p}$} are known, we proceed step by step as follows. Starting from the root, at each node, we check if the cut-off condition is verified or not: if the condition is verified, we go to the branch associated with the answer "Yes" (answering the implicit question "Is the condition verified?"), otherwise, we go to the branch associated with the answer "No".
For the separation of a node, the algorithm uses metrics such as the Gini index (the most used metric) or entropy. For example, with the Gini index, by separating 1 node into 2 child nodes, we seek to obtain the greatest increase in purity. The Gini index measures impurity. The Gini criterion organizes the separation of the leaves of a tree by focusing on the most represented class in the dataset: it must be separated as quickly as possible.
Gini Index
$I=1-\sum_{i}^{n} f_{i}^{2}$
With :
Stop criterions
Once the tree is built, the number of leaves is sometimes too large. The model must be simplified by pruning the tree to the right depth. A good pruning corresponds to the right compromise between tree complexity and prediction accuracy.
A tree that is too deep = high complexity => high variance => possible of overfitting => weakened generalization power
A tree that is not deep enough = too low complexity => high bias => risk of underfitting
The most important parameter for a Decesion Tree is the max_depthparameter. It represents the depth of each tree, which is the maximum number of different features used in each tree. A good practice for the choice of the parmeter is to start with a shallow depth of 2 for example and increment this value by 1 without exceeding 7. To sum up, I optimize the tree depth and the node separation criterion via a GridSearchCV.
# build sample of train data
temp_data = train_raw.sample(n=int(round(X_train.shape[0] * 0.5,0)), random_state=230)
X_train_reduce = temp_data.loc[:,best_selected_features]
y_train_reduce = temp_data.loc[:,["label_0"]]
# split into X_train, y_train, X_valid and y_valid
dec_tree_X_train, dec_tree_X_valid, dec_tree_y_train, dec_tree_y_valid = train_test_split(X_train_reduce, y_train_reduce, test_size=0.2)
%%time
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
param_grid = {
'ccp_alpha': [0.1, .01, .001], # learning rate
'max_depth' : [1, 2, 3, 4, 5, 6],
'criterion' :['gini', 'entropy']
}
# run grid search
tree_clas = DecisionTreeClassifier(random_state=1024)
grid_search = GridSearchCV(estimator=tree_clas, param_grid=param_grid, cv=cv, scoring=scoring_critetion, n_jobs = -1, verbose=True)
grid_search.fit(dec_tree_X_train, dec_tree_y_train)
# print best model
print(grid_search.best_estimator_)
# compute FPR + FNR score
dec_tree_y_pred = grid_search.predict(dec_tree_X_valid)
valid_score = criterion(dec_tree_y_pred, dec_tree_y_valid)
print('FPR + FNR = {}'.format(valid_score))
# to further inspect the performance:
CM = confusion_matrix(dec_tree_y_valid, dec_tree_y_pred)
TN, TP = CM[0, 0], CM[1, 1]
FP, FN = CM[0, 1], CM[1, 0]
print('Confusion Matrix: \n {}'.format(CM))
print('Accuracy: {}'.format((TP + TN) / (TP + TN + FP + FN)))
print('False Positive Rate: {}'.format(FP / (FP + TN)))
print('False Negative Rate: {}'.format(FN / (FN + TP)))
print('FPR + FNR = {}'.format(FP / (FP + TN) + FN / (FN + TP)))
plt.figure(figsize=(6,4))
plt.grid()
dec_y_prob = grid_search.predict_proba(dec_tree_X_valid)[:, 1]
fpr, tpr, thresholds = roc_curve(dec_tree_y_valid, dec_y_prob, pos_label=1)
idx = np.argmin(fpr + (1-tpr))
plt.plot(fpr, 1-tpr, label='RF')
plt.plot(fpr[idx], (1-tpr)[idx], '+', color='k')
plt.legend(loc='best')
plt.xlabel('FPR')
plt.ylabel('FNR')
plt.show()
The $FPR$ and $FNR$ rate is quite high at 0.53 enven if the accruracy is not that bad. Howerver, the algorithm does worse than the random. We also notice with the confusion matrix that the false negatives are in majority compared to the false positives. Thus, not surprisingly, the performance obtained on the validation set is quite bad. A more adequate approach is needed. It will be interesting to compare this score with the score obtained by a Random Forest.
To predict the class of a new data point, we compute the distance between all the other data points and among the $K$ points with the smallest distance we look at the majority class (most represented class). This is called a majority vote.
k-NN does not compute any predictive model and it fits in the framework of Lazy Learning because it manipulates already classified individuals for any new classification.
The loss function
Minimization of the distance ($r_{1}(x)$ : the nearest neighbor index)
$r_{1}(\mathbf{x})=i^{*} \quad$ if and only if $\quad d_{i^{*}}(\mathbf{x})=\min _{1 \leq i \leq n} d_{i}(\mathbf{x})$
The decision to classify point x is made by a Majority Vote:
$\hat{f}_{k}(\mathbf{x}) \in \underset{y \in \mathcal{Y}}{\arg \max }\left(\sum_{j=1}^{k} \mathbb{1}_{\left\{y_{r_{j}}=y\right\}}\right)$
k-NN stores the whole dataset in memory to perform a predictionTo calculate the distance between an unclassified point and other classified data points there are several metrics such as :
Without any doubt, the performance of the algorithm depends strongly on the choice of the K parameter. However, it is important to find the right compromise between :
A good practice is to start the training with a $k$ number of weak neighbors and then increase this value at each iteration of the algorithm.
To sum up I optimise the number of neighbours k with a grid search. I run the grid on a small subset of the train set because k-NN becomes quickly intractable at large scale. Also, as k-NN is sensitive to the scale of the variables, I standardize the features, making it possible to use a regular Euclidian distance for the algorithm
# build sample of train data
temp_data = train.sample(n=int(round(X_train.shape[0] * 0.02,0)), random_state=230)
X_train_reduce = temp_data.loc[:,best_selected_features]
y_train_reduce = temp_data.loc[:,["label_0"]]
# split into X_train, y_train, X_valid and y_valid
knn_X_train, knn_X_valid, knn_y_train, knn_y_valid = train_test_split(X_train_reduce, y_train_reduce, test_size=0.2)
%%time
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
max_k = 60 # set the nb neighbours at 60
# define grid
grid = {
'n_neighbors':list(range(1, max_k)),
}
# run grid search
knn = KNeighborsClassifier()
grid_search = GridSearchCV(knn, grid, cv=cv, scoring=scoring_critetion, n_jobs = -1, return_train_score = True);
grid_search.fit(knn_X_train,knn_y_train);
# print best model
print(grid_search.best_estimator_)
# plot the accuracy values given the value of k
fig_1 = plt.figure(figsize=(8,6))
plt.plot(list(range(1, max_k)), grid_search.cv_results_['mean_test_score'])
plt.xlim([0, max_k])
plt.xlabel("k")
plt.ylabel("1 - (FPR + FNR)")
plt.title('Scoring Critrerion on validation set for different values of k')
plt.grid(True)
plt.show()
# print outcome
best_k = grid_search.best_estimator_.n_neighbors
print("The optimal value for k is " + str(best_k) +\
", which corresponds to an Scoring Criterion of " + str(grid_search.best_score_) + " on the validation set.")
# compute FPR + FNR score
knn_y_pred = grid_search.predict(knn_X_valid)
valid_score = criterion(knn_y_pred, knn_y_valid)
print('FPR + FNR = {}'.format(valid_score))
# to further inspect the performance:
CM = confusion_matrix(knn_y_valid, knn_y_pred)
TN, TP = CM[0, 0], CM[1, 1]
FP, FN = CM[0, 1], CM[1, 0]
print('Confusion Matrix: \n {}'.format(CM))
print('Accuracy: {}'.format((TP + TN) / (TP + TN + FP + FN)))
print('False Positive Rate: {}'.format(FP / (FP + TN)))
print('False Negative Rate: {}'.format(FN / (FN + TP)))
print('FPR + FNR = {}'.format(FP / (FP + TN) + FN / (FN + TP)))
plt.figure(figsize=(6,4))
plt.grid()
knn_y_prob = grid_search.predict_proba(knn_X_valid)[:, 1]
fpr, tpr, thresholds = roc_curve(knn_y_valid, knn_y_prob, pos_label=1)
idx = np.argmin(fpr + (1-tpr))
plt.plot(fpr, 1-tpr, label='RF')
plt.plot(fpr[idx], (1-tpr)[idx], '+', color='k')
plt.legend(loc='best')
plt.xlabel('FPR')
plt.ylabel('FNR')
plt.show()
The $FPR$ and $FNR$ rate is quite high: 0.51. The algorithm does a little bit worse than the random. We also notice with the confusion matrix that the number of false positives and false negatives is almost similar. Thus, not surprisingly, the performance obtained on the validation set is quite bad. However, this is mainly due to the fact that the algorithm is train on a tiny data set, and also partly due to the fact that the k-NN algorithm does not produce accurate scores, but only approximate probabilities based on the $k$ nearest neighbors. A more adequate approach is needed.
The Support Vector Machine SVM can be used for both classification and regression challenges. Solving a classification problem, the SVM attempts to find the hyperplane that differentiates the two classes very well. The concept of a frontier implies that the data are linearly separable. To achieve this, SVM use kernels, i.e. mathematical functions to project and separate the data in the vector space. The separation boundary is chosen as the one that maximizes the margin. Maximizing the distance between the closest data point (of either class) and the hyperplane will help us to choose the right hyperplane. This distance is called the margin. The margin allows us to be tolerant of small variations.
The margin is the distance between the hyperplane and the closest samples. The points located on the margins are called the support vectors.
In the case where the data are not linearly separable, the SVM transforms the representation space of the input data into a higher dimensional space in which a linear separation is likely to exist. This is achieved by kernel functions.
To separate the data, SVM consider a triplet of hyperplanes:
We call geometric margin, $\rho({w})$ the smallest distance between the data and the hyperplane $H$, here therefore half the distance between $H_{1}$ and $H_{-1}$ A simple calculation gives: $ \rho({w})= \frac{1}{|{w}|}$.
The goal is pretty simply :
Optimization in the primal space
$\underset{w, b}{\operatorname{min}} \quad \frac{1}{2}\|w\|^{2}$ under the constraint $1-y_{i}\left(w^{\top} x_{i}+b\right) \leq 0, i=1, \ldots , n$.
To be efficient, SVM require some optimisation. I proceed by grid search, similarly to the k-NN models. The parameters considered are the regularisation value C, the kernel (linear, gaussian, polynomial or sigmoid), and the hyperparameter values gamma and degree associated to the kernels.
Because the search of best parameter with the grid search python function take a lot of time and the cost of fitting a SVM increases at least quadratically, I use only 1% of the train dataset for the grid search and the fit.
Lastly, as for the k-NN, SVM is sensitive to the scale of the variables, I standardize the features, making it possible to use a regular Euclidian distance for the algorithm
# build sample of train data
temp_data = train.sample(n=int(round(X_train.shape[0] * 0.01,0)), random_state=340)
X_train_reduce = temp_data.loc[:,best_selected_features]
y_train_reduce = temp_data.loc[:,["label_0"]]
# split into X_train, y_train, X_valid and y_valid
svm_X_train, svm_X_valid, svm_y_train, svm_y_valid = train_test_split(X_train_reduce, y_train_reduce, test_size=0.2)
%%time
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = {
'C':[1, 1.5, 2, 2.5, 3],
'kernel':('linear', 'poly', 'rbf', 'sigmoid'),
'gamma':['scale', "auto", 0.01, 0.1],
'degree': [2, 3, 4],
}
# run grid search
svm = SVC();
grid_search = GridSearchCV(svm, grid, cv=cv, scoring=scoring_critetion, n_jobs = -1, verbose = True);
grid_search.fit(svm_X_train,svm_y_train);
# print best model
print(grid_search.best_estimator_)
# compute FPR + FNR score
svm_y_pred = grid_search.predict(svm_X_valid)
valid_score = criterion(svm_y_pred, svm_y_valid)
print('FPR + FNR = {}'.format(valid_score))
# to further inspect the performance:
CM = confusion_matrix(svm_y_valid, svm_y_pred)
TN, TP = CM[0, 0], CM[1, 1]
FP, FN = CM[0, 1], CM[1, 0]
print('Confusion Matrix: \n {}'.format(CM))
print('Accuracy: {}'.format((TP + TN) / (TP + TN + FP + FN)))
print('False Positive Rate: {}'.format(FP / (FP + TN)))
print('False Negative Rate: {}'.format(FN / (FN + TP)))
print('FPR + FNR = {}'.format(FP / (FP + TN) + FN / (FN + TP)))
First of all, before discussing the obtained result, we can see that the search for the best hyperparameters takes a lot of time (approx. 12min) and even if the model is trained only on 1% of the training dataset.
The $FPR$ and $FNR$ rate remains high : 0.46 but this is the best score I could get for this moment. Indeed $FPR$ and $FNR$ rate is slightly less than the rate obtained by k-NN. As Decision Tree we also notice with the confusion matrix that the false positives are in majority compared to the false negatives. The algorithm does does almost as much as random. Thus, not surprisingly, the performance obtained on the validation set is not that bad even if the algorithm is train on a tiny data set.
Bagging is a technique that consists in assembling a large number of algorithms with low individual performance (shallow Decision Trees) to create a much more efficient one (Random Forest). The low performance algorithms are called "weak learners" and the result "strong learner".
Weak learners can be of different kinds and have different performances, but they must be independent of each other.
The assembly of "weak learners" (shrubs) into "strong learner" (forest) is done by voting. That is to say that each "weak learner" will emit an answer (a vote), and the prediction of the "strong learner" will be the average of all the emitted answers.
In fact, bagging combines the "best" classifiers in a way that reduces their variance. Bagging is used with decision trees, where it greatly increases the stability of the models by improving accuracy and reducing variance, thus eliminating the challenge of overfitting.
First of all, Random Forests represent a class of machine learning algorithms with solid performances in the family of ensemble learning.
Random forests are therefore an improvement of the bagging for Decision Tree in order to make the trees used more independent (less correlated). The random forest is composed of several decision trees, working independently on a classification task. Each one produces an estimate, and it is the assembly of the decision trees and their analyses that will give a global estimate. In other words, it is a matter of drawing inspiration from different opinions, dealing with the same problem, to better understand it. Each model is randomly distributed to subsets of decision trees.
The term random forest comes from the fact that the individual predictors are, here, explicitly predictors per tree, and from the fact that each tree depends on an additional random variable (i.e. in addition to $L_{n}$). A random forest is the aggregation of a collection of random trees.
The decision to classify point x is made by a Majority Vote:
$\hat{f}_{k}(\mathbf{x}) \in \underset{y \in \mathcal{Y}}{\arg \max }\left(\sum_{j=1}^{k} \mathbb{1}_{\left\{y_{r_{j}}=y\right\}}\right)$
Same as Decision Tree
Same as Decision Tree
I optimize my model with a grid search on the following parameters: the split criterion (Gini or entropy), the max depth of the trees (from 2 to 7). I voluntarily choose to focus on the search for parameters with a limited tree depth to avoid overfitting. In addition I also optimize the number of estimator parameters this corresponds to the number of Decision Tree used to make the prediction. Overall I optimize the parmeters by performing a grid search with the same parameters as those defined for the implementation of the Decision Tree algorithm.
# build sample of train data
temp_data = train_raw.sample(n=int(round(X_train.shape[0] * 0.05,0)), random_state=140)
X_train_reduce = temp_data.loc[:,best_selected_features]
y_train_reduce = temp_data.loc[:,["label_0"]]
# split into X_train, y_train, X_valid and y_valid
rf_X_train, rf_X_valid, rf_y_train, rf_y_valid = train_test_split(X_train_reduce, y_train_reduce, test_size=0.2)
%%time
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = {'criterion':('gini', 'entropy'),
'n_estimators': [100, 110, 120],
'max_depth':[2, 3, 4, 5, 6, 7],
}
# run grid search
rf = RandomForestClassifier();
grid_search = GridSearchCV(rf, grid, cv=cv, scoring=scoring_critetion, n_jobs = -1);
grid_search.fit(rf_X_train,rf_y_train);
# print best model
print(grid_search.best_estimator_)
# compute FPR + FNR score
rf_y_pred = grid_search.predict(rf_X_valid)
valid_score = criterion(rf_y_pred, rf_y_valid)
print('FPR + FNR = {}'.format(valid_score))
# to further inspect the performance:
CM = confusion_matrix(rf_y_valid, rf_y_pred)
TN, TP = CM[0, 0], CM[1, 1]
FP, FN = CM[0, 1], CM[1, 0]
print('Confusion Matrix: \n {}'.format(CM))
print('Accuracy: {}'.format((TP + TN) / (TP + TN + FP + FN)))
print('False Positive Rate: {}'.format(FP / (FP + TN)))
print('False Negative Rate: {}'.format(FN / (FN + TP)))
print('FPR + FNR = {}'.format(FP / (FP + TN) + FN / (FN + TP)))
plt.figure(figsize=(6,4))
plt.grid()
rf_y_prob = grid_search.predict_proba(rf_X_valid)[:, 1]
fpr, tpr, thresholds = roc_curve(rf_y_valid, rf_y_prob, pos_label=1)
idx = np.argmin(fpr + (1-tpr))
plt.plot(fpr, 1-tpr, label='RF')
plt.plot(fpr[idx], (1-tpr)[idx], '+', color='k')
plt.legend(loc='best')
plt.xlabel('FPR')
plt.ylabel('FNR')
plt.show()
The $FPR$ and $FNR$ rate is quite high: 0.49. The algorithm does as well than the random. But Random Forest is quite better on the classification task than Decision Tree (0.53). We also notice with the confusion matrix that the number of false positives and false negatives is almost similar. Thus, not surprisingly, the performance obtained on the validation set is quite bad.
ExtraTrees and Random Forests are close methodologies. Indeed, the both algorithm have much in common. Both are composed of a large number of decision trees, where the final decision is obtained by considering the prediction of each tree and the class prediction is decided by a majority vote
However there are two differences which are as follows:
Random Forest uses bootstrap replicas , i.e., it subsamples the input data with replacement, while Extra Trees uses the entire original sample.ExtraTrees adds randomization but still has optimization.I optimize my model with a grid search on the following parameters: the split criterion (Gini or entropy), the max depth of the trees (from 1 to 7). I voluntarily choose to focus on the search for parameters with a limited tree depth to avoid overfitting. Since Extra Tree and Random Forests adopt very similar approaches, I used the same optimization strategy to refine ExtraTrees learning so, my approach is thus similar to random forests.
# build sample of train data
temp_data = train_raw.sample(n=int(round(X_train.shape[0] * 0.1,0)), random_state=140)
X_train_reduce = temp_data.loc[:,best_selected_features]
y_train_reduce = temp_data.loc[:,["label_0"]]
# split into X_train, y_train, X_valid and y_valid
etr_X_train, etr_X_valid, etr_y_train, etr_y_valid = train_test_split(X_train_reduce, y_train_reduce, test_size=0.2)
%%time
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = {'criterion':('gini', 'entropy'),
'n_estimators': [100, 110, 120],
'max_depth':[2, 3, 4, 5, 6, 7],
}
# run grid search
etr = ExtraTreesClassifier();
grid_search = GridSearchCV(etr, grid, cv=cv, scoring=scoring_critetion, n_jobs = -1);
grid_search.fit(etr_X_train,etr_y_train);
# print best model
print(grid_search.best_estimator_)
etr_y_pred = grid_search.predict(etr_X_valid)
valid_score = criterion(etr_y_pred, etr_y_valid)
print('FPR + FNR = {}'.format(valid_score))
# to further inspect the performance:
CM = confusion_matrix(etr_y_valid, etr_y_pred)
TN, TP = CM[0, 0], CM[1, 1]
FP, FN = CM[0, 1], CM[1, 0]
print('Confusion Matrix: \n {}'.format(CM))
print('Accuracy: {}'.format((TP + TN) / (TP + TN + FP + FN)))
print('False Positive Rate: {}'.format(FP / (FP + TN)))
print('False Negative Rate: {}'.format(FN / (FN + TP)))
print('FPR + FNR = {}'.format(FP / (FP + TN) + FN / (FN + TP)))
plt.figure(figsize=(6,4))
plt.grid()
etr_y_prob = grid_search.predict_proba(etr_X_valid)[:, 1]
fpr, tpr, thresholds = roc_curve(etr_y_valid, etr_y_prob, pos_label=1)
idx = np.argmin(fpr + (1-tpr))
plt.plot(fpr, 1-tpr, label='RF')
plt.plot(fpr[idx], (1-tpr)[idx], '+', color='k')
plt.legend(loc='best')
plt.xlabel('FPR')
plt.ylabel('FNR')
plt.show()
The $FPR$ and $FNR$ rate is quite high: 0.55. The algorithm does worse than the random. But Random Forest(0.49) is better on the classification task than ExtraTree. we also notice with the confusion matrix that the false positives are in great majority compared to the false negatives. Thus, not surprisingly, the performance obtained on the validation set is quite bad.
The principle of boosting is to combine the outputs of several weak classifiers to obtain a much more accurate prediction (strong classifier).
The boosting method is used to decrease the bias. Each weak classifier is weighted by the quality of its classification: the better it classifies, the more important it will be. The poorly classified examples will have a greater weight (we say they are boosted) towards the weak learner in the next round, so that it makes up for the lack.
In fact, after analyzing the first tree, they increase the weight of each observation that the model fails to classify correctly. On the other hand, they decrease the weight of those whose classification is not a problem. The original idea is to improve the predictions of the first tree. In fact the goal is to correct the shortcomings of the previous tree.
The "weak learners" of AdaBoost are generally decision trees with only 2 branches and 2 leaves (also called stumps) but we can use other types of classifiers.
Here are the steps to build the first "weak learner" that we will call $w_{1}$ :
The score will allow us to determine the weight to be given to which "weak learner" at the time of the final vote.
Moreover, we want the next weak learner to be able to correct the mistakes of the previous one. To do this, we will increase the weight of the lines on which the first weak learner was wrong, and decrease those on which the first weak learner was right. Here are the steps to build the other "weak learner":
Contrary to $w_{1}$, the next "weak learner" will take into account the weights assigned to the lines. The higher the weight of a line, the more important it is for the weak learner to classify this line correctly, and inversely.
To apply the algorithm, I chose to use a Decision Tree classifier.
I optimized the parameters of the algorithm by varying the number of estimators (i.e. the number of weak learners such as Decision Tree) as well as the learning rate. In addition I decided to choose the best hyperparameters selected by the grid serach of my Decision Tree algorithm. But by iterating the model several times I could see that a smaller tree depth generally led to better performance so I set the depth of the decision tree to 2 (max_depth = 2)
# build sample of train data
temp_data = train_raw.sample(n=int(round(X_train.shape[0] * 0.05,0)), random_state=140)
X_train_reduce = temp_data.loc[:,best_selected_features]
y_train_reduce = temp_data.loc[:,["label_0"]]
# split into X_train, y_train, X_valid and y_valid
adb_tree_X_train, adb_tree_X_valid, adb_tree_y_train, adb_tree_y_valid = train_test_split(X_train_reduce, y_train_reduce, test_size=0.2)
%%time
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = {'learning_rate': [0.01, 0.1, 1.0],
'n_estimators': [100, 110, 120],
}
# run grid search
adb_tree = AdaBoostClassifier(DecisionTreeClassifier(ccp_alpha=0.001, criterion='entropy', max_depth=2), random_state=0)
grid_search = GridSearchCV(adb_tree, grid, cv=cv, scoring=scoring_critetion, n_jobs = -1);
# fit on train set
grid_search.fit(adb_tree_X_train, adb_tree_y_train)
# print best model
print(grid_search.best_estimator_)
adb_tree_y_pred = grid_search.predict(adb_tree_X_valid)
valid_score = criterion(adb_tree_y_pred, adb_tree_y_valid)
print('FPR + FNR = {}'.format(valid_score))
# to further inspect the performance:
CM = confusion_matrix(adb_tree_y_valid, adb_tree_y_pred)
TN, TP = CM[0, 0], CM[1, 1]
FP, FN = CM[0, 1], CM[1, 0]
print('Confusion Matrix: \n {}'.format(CM))
print('Accuracy: {}'.format((TP + TN) / (TP + TN + FP + FN)))
print('False Positive Rate: {}'.format(FP / (FP + TN)))
print('False Negative Rate: {}'.format(FN / (FN + TP)))
print('FPR + FNR = {}'.format(FP / (FP + TN) + FN / (FN + TP)))
plt.figure(figsize=(6,4))
plt.grid()
adb_tree_y_prob = grid_search.predict_proba(adb_tree_X_valid)[:, 1]
fpr, tpr, thresholds = roc_curve(adb_tree_y_valid, adb_tree_y_prob, pos_label=1)
idx = np.argmin(fpr + (1-tpr))
plt.plot(fpr, 1-tpr, label='RF')
plt.plot(fpr[idx], (1-tpr)[idx], '+', color='k')
plt.legend(loc='best')
plt.xlabel('FPR')
plt.ylabel('FNR')
plt.show()
With the second approach I managed to get the rate of $FPR$ and $FNR$ down to 0.51. So, not surprisingly, the performances obtained on the validation set are much better because this time I used all the dataset to train my model. Also increasing the number of estimators seems to improve the performances. Surprisingly the boosting algorithms don't provide much better performance than the bagging algorithms and the classical approaches such as SVM, k-NN. But it is still more efficient thanDecision Tree
The Gradent Boosting algorithm has a lot in common with Adaboost. Like Adaboost, it is a set of weak learners, created one after the other, forming a strong learner. Moreover, each weak learner is trained to correct the mistakes of the previous weak learners. However, unlike Adaboost, all weak learners have equal weight in the voting system, regardless of their performance.
The first weak learner ($w_{1}$) is very basic, it is simply the average of the observations. It is therefore not very efficient, but it will serve as a basis for the rest of the algorithm.
Afterwards, we compute the difference between this average and the reality, which we call the first residual. In general, we will call the difference between the prediction of the algorithm and the reality, i.e. the expected value.
The particularity of Gradient Boosting is that it tries to predict at each step not the data itself but the residues.
Thus, the second "weak learner" is trained to predict the first residual.
The predictions of the second weak learner are then multiplied by a factor less than 1.
The idea behind this multiplication is that several small steps are more accurate than a few large steps. The multiplication therefore reduces the size of the "steps" to increase the accuracy. The objective is to "move" the predictions of the model away from the mean, little by little, to bring them closer to reality. From this moment, the creation of the weak learners always follows the same pattern:
I follow a usual grid search approach for the main parameters of interest: the learning rate, the maximum depth, and the maximum number of features retained.
# build sample of train data
temp_data = train_raw.sample(n=int(round(X_train.shape[0] * 0.05,0)), random_state=140)
X_train_reduce = temp_data.loc[:,best_selected_features]
y_train_reduce = temp_data.loc[:,["label_0"]]
# split into X_train, y_train, X_valid and y_valid
gb_X_train, gb_X_valid, gb_y_train, gb_y_valid = train_test_split(X_train_reduce, y_train_reduce, test_size=0.2)
%%time
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = {'learning_rate':[0.2, 0.3],
'max_depth':[1, 2, 3, 4, 6],
'max_features':['sqrt', "log2", 0.2]}
# run grid search
gb = GradientBoostingClassifier()
grid_search = GridSearchCV(gb, grid, cv=cv, scoring=criterion_GridCV, n_jobs = -1);
grid_search.fit(gb_X_train, gb_y_train);
print(grid_search.best_estimator_)
gb_y_pred = grid_search.predict(gb_X_valid)
valid_score = criterion(gb_y_pred, gb_y_valid)
print('FPR + FNR = {}'.format(valid_score))
# to further inspect the performance:
CM = confusion_matrix(gb_y_valid, gb_y_pred)
TN, TP = CM[0, 0], CM[1, 1]
FP, FN = CM[0, 1], CM[1, 0]
print('Confusion Matrix: \n {}'.format(CM))
print('Accuracy: {}'.format((TP + TN) / (TP + TN + FP + FN)))
print('False Positive Rate: {}'.format(FP / (FP + TN)))
print('False Negative Rate: {}'.format(FN / (FN + TP)))
print('FPR + FNR = {}'.format(FP / (FP + TN) + FN / (FN + TP)))
plt.figure(figsize=(6,4))
plt.grid()
gb_y_prob = grid_search.predict_proba(gb_X_valid)[:, 1]
fpr, tpr, thresholds = roc_curve(gb_y_valid, gb_y_prob, pos_label=1)
idx = np.argmin(fpr + (1-tpr))
plt.plot(fpr, 1-tpr, label='RF')
plt.plot(fpr[idx], (1-tpr)[idx], '+', color='k')
plt.legend(loc='best')
plt.xlabel('FPR')
plt.ylabel('FNR')
plt.show()
The $FPR$ and $FNR$ rate is quite high: 0.54. The algorithm does worse than the random.
XGBoost is an improved version of the Gradient Boosting algorithm. Indeed, it relies on a set of "weak learners" who predict the residuals, and correct the errors of the previous "weak learners".
The main difference between XGBoost and other implementations of the Gradient Boosting method is that XGBoost is computationally optimized to make the various computations required to apply Gradient Boosting faster. Specifically, XGBoost processes data in multiple compressed blocks allowing for much faster sorting and parallel processing.
But the advantages of XGBoost are not only linked to the implementation of the algorithm, and thus to its performance, but also to the various parameters it offers. Indeed XGBoost offers a panel of very important hyperparameters; it is thus possible thanks to this diversity of parameters, to have a total control on the implementation of Gradient Boosting It is also possible to add different regularizations to the loss function, limiting a phenomenon that happens quite often when using gradient boosting algorithms: overfitting.
For the choice of the parameters of this algorithm I took the decision to find the best parameters of the algorithm manually in order to be able to keep the hand on the refinement of the model which is not the case with a grid search.
The strategy I have implemented is as follows:
1/ Firstly I intialize the hyperparameters of the model by filling reasonable values for key inputs:
learning_rate: 0.01n_estimators: 100 because I train the model on the entire train datasetmax_depth: 3subsample: 0.8colsample_bytree: 1gamma: 1objective='binary:logistic' I use logistic regression for binary classification as objective function, because it is the most suitable objective function for the binary classification task of this data challenge
2/ Run model.fit(eval_set, eval_metric) and diagnose the first run, specifically the n_estimators parameter. After several iterations I noticed that beyond 103 estimators the performance drops. So I set the parameter to 103 before varying the other most important parameters of XGBoost and analyze their influence on the performance of the model
3/ Optimize max_depth parameter. It represents the depth of each tree, which is the maximum number of different features used in each tree. To find the best value of this parameter I firstly chose going from a low max_depth (3 for instance) and then increasing it incrementally by 1, and stopping when there’s no performance gain of increasing it. It is necessary to handle this parameter with care, because if this parameter is too high it can lead to overfitting the model.
4/ Try different value of learning rate and the features that avoids overfitting:
learning_rate: A lower learning rate can increase the prediction performance but increase the training time of the algorithm. I chose a learning rate at 0.3 because after several iterations of the algorithm it is the one that offers the best compromise performance/training time. subsample, which is for each tree the % of rows taken to build the tree. I choose the default parameter which is 1 because not taking out too many rows, as performance will drop a lot.
colsample_bytree: It is the number of columns used by each tree. I set this parameter because at 1 I use a few features of my dataset.
gamma: It acts as a regularization parameter. I use the default parameter because in my case I found that changing the value of this parmeter did not influence the performance
# # build sample of train data
temp_data = train_raw.sample(n=int(round(X_train.shape[0] * 1,0)), random_state=140)
X_train_reduce = temp_data.loc[:,best_selected_features]
y_train_reduce = temp_data.loc[:,["label_0"]]
# split into X_train, y_train, X_valid and y_valid
xgbc_X_train, xgbc_X_valid, xgbc_y_train, xgbc_y_valid = train_test_split(X_train_reduce, y_train_reduce, test_size=0.2, random_state=122, stratify=y_df_cleaned)
%%time
# fit model no training data
xgbc = XGBClassifier(booster='gbtree', learning_rate=0.3,
max_depth=6, n_estimators=103,
colsample_bynode=1, colsample_bytree=1,
subsample=1, gamma=0, objective='binary:logistic')
xgbc.fit(xgbc_X_train, xgbc_y_train)
xgbc_y_pred = xgbc.predict(xgbc_X_valid)
valid_score = criterion(xgbc_y_pred, xgbc_y_valid)
print('FPR + FNR = {}'.format(valid_score))
# to further inspect the performance:
CM = confusion_matrix(xgbc_y_valid, xgbc_y_pred)
TN, TP = CM[0, 0], CM[1, 1]
FP, FN = CM[0, 1], CM[1, 0]
print('Confusion Matrix: \n {}'.format(CM))
print('Accuracy: {}'.format((TP + TN) / (TP + TN + FP + FN)))
print('False Positive Rate: {}'.format(FP / (FP + TN)))
print('False Negative Rate: {}'.format(FN / (FN + TP)))
print('FPR + FNR = {}'.format(FP / (FP + TN) + FN / (FN + TP)))
plt.figure(figsize=(6,4))
plt.grid()
gb_y_prob = xgbc.predict_proba(xgbc_X_valid)[:, 1]
fpr, tpr, thresholds = roc_curve(xgbc_y_valid, gb_y_prob, pos_label=1)
idx = np.argmin(fpr + (1-tpr))
plt.plot(fpr, 1-tpr, label='RF')
plt.plot(fpr[idx], (1-tpr)[idx], '+', color='k')
plt.legend(loc='best')
plt.xlabel('FPR')
plt.ylabel('FNR')
plt.show()
The $FPR$ and $FNR$ rate is lower that I obtained with XGBoost for this data challenge: 0.45. Not surprisingly, the performance of XGBoost are the best.
It seems that the XGBoost model is the model that gives the best performance. The idea is to improve the performance of 1st XGBoost Approach model. To do this, I want to see if training the model on the whole training dataset gives better results and then fine-tune the hyperparameters of the model.
Let's try to run XGBoost on the entire dataset without taken into account the features selected.
# split into X_train, y_train, X_valid and y_valid
xgbc_X_train, xgbc_X_valid, xgbc_y_train, xgbc_y_valid = train_test_split(X_dataframe, y_dataframe, test_size=0.2, random_state=12)
%%time
# fit model
xgbc = XGBClassifier(booster='gbtree', learning_rate=0.3,
max_depth=6, n_estimators=103,
colsample_bynode=1, colsample_bytree=1,
subsample=1, gamma=0, objective='binary:logistic')
xgbc.fit(xgbc_X_train, xgbc_y_train)
xgbc_y_pred = xgbc.predict(xgbc_X_valid)
valid_score = criterion(xgbc_y_pred, xgbc_y_valid)
print('FPR + FNR = {}'.format(valid_score))
# to further inspect the performance:
CM = confusion_matrix(xgbc_y_valid, xgbc_y_pred)
TN, TP = CM[0, 0], CM[1, 1]
FP, FN = CM[0, 1], CM[1, 0]
print('Confusion Matrix: \n {}'.format(CM))
print('Accuracy: {}'.format((TP + TN) / (TP + TN + FP + FN)))
print('False Positive Rate: {}'.format(FP / (FP + TN)))
print('False Negative Rate: {}'.format(FN / (FN + TP)))
print('FPR + FNR = {}'.format(FP / (FP + TN) + FN / (FN + TP)))
plt.figure(figsize=(6,4))
plt.grid()
gb_y_prob = xgbc.predict_proba(xgbc_X_valid)[:, 1]
fpr, tpr, thresholds = roc_curve(xgbc_y_valid, gb_y_prob, pos_label=1)
idx = np.argmin(fpr + (1-tpr))
plt.plot(fpr, 1-tpr, label='RF')
plt.plot(fpr[idx], (1-tpr)[idx], '+', color='k')
plt.legend(loc='best')
plt.xlabel('FPR')
plt.ylabel('FNR')
plt.show()
The $FPR$ and $FNR$ rate is the slower for this notebook: 0.40. Indeed, I managed to greatly improve the classification performance of my model by using all the features of the dataset. This is a very surprising thing. It means that training the model on the dataset with a limited number of features reduces the performances. Therefore, the variable selection for this data challenge does not seem to improve the performance of the models.
As we have just seen with the implementation of the XGBoost model, training the model on the whole dataset at our disposal leads to better performance. Therefore, for the rest of this notebook, I will directly train the next models on the whole dataset.
Moreover, we could notice that it is the boosting algorithms that provide the best performances for the moment. There are other boosting algorithms which are an improvement of the XGBoost model like LightGBM and CatBoost. It is therefore necessary to test the performance of these models on our dataset
Similar to XGBoost, LightGBM developped by Microsoft is a high-performance distributed framework that uses decision trees for ranking, classification and regression tasks. LightGBM is significantly faster than XGBoost but delivers almost equivalent performance. Faster training speed and accuracy resulting from LightGBM being a histogram-based algorithm that performs bucketing of values that also requires less memory. One of the strong points of is that LightGBM is also compatible with large and complex datasets but is much faster during training.
In contrast to the level-wise (horizontal) growth in XGBoost, LightGBM carries out leaf-wise (vertical) growth that results in more loss reduction and in turn higher accuracy while being faster. But this may also result in overfitting on the training data which could be handled using the max-depth parameter that specifies where the splitting would occur. Hence, XGBoost is capable of building more robust models than LightGBM.
As I said previously, Since LightGBM is similar to XGBoost, these two models share broadly the same parameters. Therefore, I used the same strategy for parameter refinement as the one used when implementing the XGBoost model
Hence, since LightGBM is similar to XGBoost, these two models share broadly the same parameters. Therefore, I used the same strategy for parameter refinement as the one used when implementing the XGBoost model. First, I initialize the model hyperparameters by filling in reasonable values for the key inputs :
learning_rate: 0.3n_estimators: 100 because I train the model on the entire train datasetmax_depth: 3subsample: 0.8colsample_bytree: 1objective='binary'Then I refined the parameters by hand instead of using a grid search because I wanted to keep the total control on the optimization of the parameters (max_depth, learning_rate, n_estimators,...) by modifying very finely each parameter on a case by case basis to obtain the best model for our classification task.
As a result, for the optimal choice of hyperparameters for the LightGBM algorithm, I used a lower learning rate than I set for the XGBoost model and a much larger number of estimators. This can be explained in part by the difference in strategy between XGBoost and LighGBM regarding the level-wise growth strategy that I mentioned before.
But globally the best parameters I used are very similar to those I used for XGBoost (same max_depth value, same colsample_bynode value, colsample_bytree value, subsample value). This confirms the choices I made during the XGBoost algorithm since the models are globally based on the same operation
lgbc_X_train, lgbc_X_valid, lgbc_y_train, lgbc_y_valid = train_test_split(X_dataframe, y, test_size=0.2, random_state=34)
%%time
# fit model no training data
lgbc = LGBMClassifier(objective= 'binary', learning_rate=0.1, n_estimators = 2000,
max_depth=6, colsample_bynode=1, colsample_bytree=1,
subsample=1)
lgbc.fit(lgbc_X_train, lgbc_y_train, verbose=True)
lgbc_y_pred = lgbc.predict(lgbc_X_valid)
valid_score = criterion(lgbc_y_pred, lgbc_y_valid)
print('FPR + FNR = {}'.format(valid_score))
# to further inspect the performance:
CM = confusion_matrix(lgbc_y_valid, lgbc_y_pred)
TN, TP = CM[0, 0], CM[1, 1]
FP, FN = CM[0, 1], CM[1, 0]
print('Confusion Matrix: \n {}'.format(CM))
print('Accuracy: {}'.format((TP + TN) / (TP + TN + FP + FN)))
print('False Positive Rate: {}'.format(FP / (FP + TN)))
print('False Negative Rate: {}'.format(FN / (FN + TP)))
print('FPR + FNR = {}'.format(FP / (FP + TN) + FN / (FN + TP)))
plt.figure(figsize=(6,4))
plt.grid()
gb_y_prob = lgbc.predict_proba(lgbc_X_valid)[:, 1]
fpr, tpr, thresholds = roc_curve(lgbc_y_valid, gb_y_prob, pos_label=1)
idx = np.argmin(fpr + (1-tpr))
plt.plot(fpr, 1-tpr, label='RF')
plt.plot(fpr[idx], (1-tpr)[idx], '+', color='k')
plt.legend(loc='best')
plt.xlabel('FPR')
plt.ylabel('FNR')
plt.show()
The $FPR$ and $FNR$ rate is the lower for this notebook: 0.39. LightGBM outperforms XGBoost. Maybe it's just because I'm taking advantage of LightGBM's improvements or simply because of a better choice of hyperparameters or both at the same time. In any case, LightGBM is the best model to perform the classification task for this data challenge
CatBoost is an open-source machine learning (gradient boosting) algorithm, whose name comes from "Category" and "Boosting".
CatBoost builds symmetric (balanced) trees, unlike XGBoost and LightGBM. At each step, the leaves of the previous tree are split using the same condition. The feature-split pair that represents the lowest loss is selected and used for all nodes in the tier. This balanced tree architecture facilitates efficient processor implementation, reduces prediction time, makes model applicators fast, and controls overfitting as the structure serves as a regularization.
Classic boosting algorithms are prone to overfitting on small/noisy data sets due to a problem known as prediction lag. When computing the gradient estimate of a data instance, these algorithms use the same data instances with which the model was built, thus having no chance of encountering unseen data. CatBoost, on the other hand, uses the concept of ordered boosting, a permutation-based approach to train the model on one subset of the data while computing residuals on another subset, thus preventing target leakage and overfitting.
Globally, CatBoost is based on the same principle, namely the boosting technique, but integrates new approaches that allow it to be in some cases more efficient than XGBoost and LightGBM, both in terms of prediction time and accuracy of the generated prediction.
As I said previously, Since CatBoost is similar to XGBoost and LightGBM, these two models share broadly the same parameters. Therefore, I used the same strategy for parameter refinement as the one used when implementing the XGBoost model
Hence, since LightGBM is similar to XGBoost, these two models share broadly the same parameters. Therefore, I used the same strategy for parameter refinement as the one used when implementing the XGBoost model. First, I initialize the model hyperparameters by filling in reasonable values for the key inputs :
learning_rate: 0.3iteration: 1000 equivalent to n_estimators for XGBoost and LightGBMsubsample: 1eval_metric='LogLoss' similar to objective:binary:logistic for XGBoostThen I refined the parameters by hand instead of using a grid search because I wanted to keep the total control on the optimization of the parameters (learning_rate, iteration, subsample,...) by modifying very finely each parameter on a case by case basis to obtain the best model for our classification task.
As a result, for the optimal choice of hyperparameters for the LightGBM algorithm, I used the same learning rate than I set for the LightGBM model and a much larger number of estimators than XGBoost but less than LightGBM.
Moreover I set a lower value for the subsample hyperparameter than I set for LightGBM and XGBoost.
catgbc_X_train, catgbc_X_valid, catgbc_y_train, catgbc_y_valid = train_test_split(X_dataframe, y, test_size=0.2, random_state=12)
%%time
# fit model no training data
catgbc = CatBoostClassifier(eval_metric= 'Logloss', iterations= 1500,
learning_rate= 0.1, subsample= 0.8)
catgbc.fit(catgbc_X_train, catgbc_y_train, verbose=True)
catgbc_y_pred = catgbc.predict(catgbc_X_valid)
valid_score = criterion(catgbc_y_pred, catgbc_y_valid)
print('FPR + FNR = {}'.format(valid_score))
# to further inspect the performance:
CM = confusion_matrix(catgbc_y_valid, catgbc_y_pred)
TN, TP = CM[0, 0], CM[1, 1]
FP, FN = CM[0, 1], CM[1, 0]
print('Confusion Matrix: \n {}'.format(CM))
print('Accuracy: {}'.format((TP + TN) / (TP + TN + FP + FN)))
print('False Positive Rate: {}'.format(FP / (FP + TN)))
print('False Negative Rate: {}'.format(FN / (FN + TP)))
print('FPR + FNR = {}'.format(FP / (FP + TN) + FN / (FN + TP)))
plt.figure(figsize=(6,4))
plt.grid()
gb_y_prob = catgbc.predict_proba(catgbc_X_valid)[:, 1]
fpr, tpr, thresholds = roc_curve(catgbc_y_valid, gb_y_prob, pos_label=1)
idx = np.argmin(fpr + (1-tpr))
plt.plot(fpr, 1-tpr, label='RF')
plt.plot(fpr[idx], (1-tpr)[idx], '+', color='k')
plt.legend(loc='best')
plt.xlabel('FPR')
plt.ylabel('FNR')
plt.show()
The $FPR$ and $FNR$ rate is the lower for this notebook: 0.38. CatBoost provides better performance than XGBoost and LightGBM on the binary classification task.
The voting classifier aggregates the predicted class or predicted probability on basis of hard voting or soft voting. So if the goal is to feed a variety of base models to the voting classifier it makes sure to resolve the error by any model.
To implement the Voting Classifier I use the VotingClassifier lib from scikit-learn.
Moreover I use the voting: hard hyperparameter simply because majority vote strategy improve performance. That means, the class most predicted by the 3 boosting models is the final class predicted by the voting classifier model.
I choose two models encompassing the 3 best classifiers which are XGBoost, LightGBM, CatBoost. For each classifier, I use the best hyperparameters that I defined when I implemented these classifiers independently.
voting_gbc_X_train, voting_gbc_X_valid, voting_gbc_y_train, voting_gbc_y_valid = train_test_split(X_dataframe, y, test_size=0.2, random_state=57)
%%time
estimators = [
('xgbc', XGBClassifier(booster='gbtree', learning_rate=0.3,
max_depth=6, n_estimators=103,
colsample_bynode=1, colsample_bytree=1,
subsample=1, gamma=0,
objective='binary:logistic', random_state=57)),
('lgbc', LGBMClassifier(objective= 'binary',
n_estimators = 2000, random_state=57)),
('catgbc', CatBoostClassifier(eval_metric= 'Logloss', iterations= 1500,
learning_rate= 0.1, subsample= 0.8, random_state=57))
]
voting_clf = VotingClassifier(estimators=estimators, voting='hard')
voting_clf.fit(voting_gbc_X_train, voting_gbc_y_train)
voting_clf_y_pred = voting_clf.predict(voting_gbc_X_valid)
valid_score = criterion(voting_clf_y_pred, voting_gbc_y_valid)
print('FPR + FNR = {}'.format(valid_score))
# to further inspect the performance:
CM = confusion_matrix(voting_gbc_y_valid, voting_clf_y_pred)
TN, TP = CM[0, 0], CM[1, 1]
FP, FN = CM[0, 1], CM[1, 0]
print('Confusion Matrix: \n {}'.format(CM))
print('Accuracy: {}'.format((TP + TN) / (TP + TN + FP + FN)))
print('False Positive Rate: {}'.format(FP / (FP + TN)))
print('False Negative Rate: {}'.format(FN / (FN + TP)))
print('FPR + FNR = {}'.format(FP / (FP + TN) + FN / (FN + TP)))
plt.figure(figsize=(6,4))
plt.grid()
gb_y_prob = catgbc.predict_proba(voting_gbc_X_valid)[:, 1]
fpr, tpr, thresholds = roc_curve(voting_gbc_y_valid, gb_y_prob, pos_label=1)
idx = np.argmin(fpr + (1-tpr))
plt.plot(fpr, 1-tpr, label='RF')
plt.plot(fpr[idx], (1-tpr)[idx], '+', color='k')
plt.legend(loc='best')
plt.xlabel('FPR')
plt.ylabel('FNR')
plt.show()
The $FPR$ and $FNR$ rate is the lower for this notebook: 0.382. The voting system provides that encompass Three boosting models give better performance than XGBoost, CatBoost and LightGBM on the binary classification task. Indeed, the combination of 3 models associated with a majority voting system definitely improves the performance on the binary classification task for this data challenge.
Predicts the class label based on the argmax of the sums of the predicted probabilities
voting_gbc_X_train, voting_gbc_X_valid, voting_gbc_y_train, voting_gbc_y_valid = train_test_split(X_dataframe, y, test_size=0.2, random_state=69)
%%time
estimators = [
('xgbc', XGBClassifier(booster='gbtree', learning_rate=0.3,
max_depth=6, n_estimators=103,
colsample_bynode=1, colsample_bytree=1,
subsample=1, gamma=0,
objective='binary:logistic', random_state=69)),
('lgbc', LGBMClassifier(objective= 'binary',
n_estimators = 2000, random_state=69)),
('catgbc', CatBoostClassifier(eval_metric= 'Logloss', iterations= 1500,
learning_rate= 0.1, subsample= 0.8, random_state=69))
]
voting_clf = VotingClassifier(estimators=estimators, voting='soft')
voting_clf.fit(voting_gbc_X_train, voting_gbc_y_train)
voting_clf_y_pred = voting_clf.predict(voting_gbc_X_valid)
valid_score = criterion(voting_clf_y_pred, voting_gbc_y_valid)
print('FPR + FNR = {}'.format(valid_score))
# to further inspect the performance:
CM = confusion_matrix(voting_gbc_y_valid, voting_clf_y_pred)
TN, TP = CM[0, 0], CM[1, 1]
FP, FN = CM[0, 1], CM[1, 0]
print('Confusion Matrix: \n {}'.format(CM))
print('Accuracy: {}'.format((TP + TN) / (TP + TN + FP + FN)))
print('False Positive Rate: {}'.format(FP / (FP + TN)))
print('False Negative Rate: {}'.format(FN / (FN + TP)))
print('FPR + FNR = {}'.format(FP / (FP + TN) + FN / (FN + TP)))
plt.figure(figsize=(6,4))
plt.grid()
gb_y_prob = voting_clf.predict_proba(voting_gbc_X_valid)[:, 1]
fpr, tpr, thresholds = roc_curve(voting_gbc_y_valid, gb_y_prob, pos_label=1)
idx = np.argmin(fpr + (1-tpr))
plt.plot(fpr, 1-tpr, label='RF')
plt.plot(fpr[idx], (1-tpr)[idx], '+', color='k')
plt.legend(loc='best')
plt.xlabel('FPR')
plt.ylabel('FNR')
plt.show()
The $FPR$ and $FNR$ rate is the lower for this notebook: 0.384. The voting system provides that encompass Three boosting models give better performance than XGBoost, CatBoost and LightGBM on the binary classification task. But a voting system with a majority vote generates better performance than a soft vote which predict the class label based on the argmax of the sums of the predicted probabilities. So the best model remains the voting system with a majority vote.
Without no doubt, nowadays neural networks outperform most traditional approaches. For computer vision tasks they are one of the best options. However, neural network approaches require an adequate architecture and suffer from a training time that can be very long. Moreover, to train a neural network efficiently, it requires a large volume of data.
To be efficient, neural networks need an adequate architecture based on the stacking of different layers.
In order to find the best architecture, I have made several attempts and the architecture I have chosen is a neural network with 3 dense layers composed of 48 neurons (layer_size = 48) per layer associated with a ReLu type activation function. I also use batch normalisation for improved performance. To prevent the risk of overfitting I set a dropout at 0.5 at each layer. Indeed, overfitting is a frequent problem when training a Deep Learning model, but a technique exists to counter it: the Dropout. In fact, the term "Dropout" refers to the deletion of neurons in the layers of a Deep Learning model. In fact, we temporarily deactivate some neurons in the network, as well as all its input and output connections. More concretely, At each epoch, we apply this random deactivation. That is to say that at each pass (forward propagation) the model will learn with a different configuration of neurons, the neurons being randomly activated and deactivated.
Regarding the training strategy of the neural network, I chose to train the neural network on the entire training dataset without taking into account the selected variables because a neural network requires a large volume of data to be effective. Thus the parameters chosen for the training are the following:
epoch = 20 batch_size = 1024We note that:
# split into X_train, y_train, X_valid and y_valid
nn_X_train, nn_X_valid, nn_y_train, nn_y_valid = train_test_split(X_dataframe, y_dataframe, test_size=0.2, random_state=12)
# convert data to numpy
nn_X_train = np.array(nn_X_train)
nn_y_train = np.ravel(np.array(nn_y_train))
nn_X_valid = np.array(nn_X_valid)
nn_y_valid = np.array(nn_y_valid)
# architecture of the network
nn_epochs = 20
nn_batch_size = 1024
nn_verbose = 1
nn_input_size = nn_X_train.shape[1]
nn_layer_size = 48 # neurons number
nn_validation_split = 0.1
nn_dropout = 0.5 # manage overfitting
# building of the model
model = tf.keras.models.Sequential()
# dense layers
model.add(keras.layers.Dense(nn_layer_size,input_shape=(nn_input_size,), name='dense_layer_1', use_bias=False))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Activation("relu"))
model.add(keras.layers.Dropout(nn_dropout))
model.add(keras.layers.Dense(nn_layer_size, name='dense_layer_2', use_bias=False))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Activation("relu"))
model.add(keras.layers.Dropout(nn_dropout))
model.add(keras.layers.Dense(nn_layer_size, name='dense_layer_3', use_bias=False))
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Activation("relu"))
model.add(keras.layers.Dropout(nn_dropout))
# decision layer
model.add(keras.layers.Dense(1, name='dense_layer_final', activation='sigmoid'))
# summary of the model
model.summary()
# compilation
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=["accuracy"])
%%time
# training of the model
history = model.fit(nn_X_train, nn_y_train, batch_size=nn_batch_size, epochs=nn_epochs, verbose=nn_verbose, \
validation_split=nn_validation_split);
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# compute FPR + FNR score
# set a threshold to convert sigmoid ouput probablity to binary value {True, False}
nn_y_pred = tf.greater(model.predict(nn_X_valid), 0.5)
# convert tensor of {True, False} binary value to int {1, 0} value
nn_y_pred = nn_y_pred.numpy().astype(int)
valid_score = criterion(nn_y_pred, nn_y_valid)
print('FPR + FNR = {}'.format(valid_score))
The $FPR$ and $FNR$ rate is not the slower of this notebook: 0.44. Indeed, boosting algorithm suach as XGBoost, LightGBM and CatBoost great better with $FPR$ and $FNR$ rate at 0.40 than the neural network. This poor performance can be explained by the fact that the selected architecture may not be the best one available.
The objective is to extend the number of characteristics of the training dataset and apply my best model which is the combination of XGBoost, LightGBM and CatBoost models in a voting classifier with the same hyperparameters
To do this, it is necessary to create linear combinations between the vectors Z1 and Z2 such as Z1+Z2, Z1-Z2
X_dataframe training dataset¶X_dataframe_enlarged = X_dataframe.copy()
for i in range(X_dataframe_enlarged.shape[1]):
col_A = X_dataframe_enlarged.iloc[:,i]
col_B = X_dataframe_enlarged.iloc[:,48+i]
X_dataframe_enlarged["col_"+str(i)+"minus_col_"+str(48+i)] = col_A - col_B
X_dataframe_enlarged["col_"+str(i)+"plus_col_"+str(48+i)] = col_A + col_B
X_dataframe_enlarged["col_"+str(i)+"dot_col_"+str(48+i)] = col_A * col_B
X_dataframe_enlarged
X_test_dataframe test dataset¶# Load test data
X_test = np.load("test_data.npy")
X_test_enlarged = X_test.copy()
X_test_dataframe_enlarged = pd.DataFrame(X_test_enlarged)
for i in range(X_test_dataframe_enlarged.shape[1]):
col_A = X_test_dataframe_enlarged.iloc[:,i]
col_B = X_test_dataframe_enlarged.iloc[:,48+i]
X_test_dataframe_enlarged["col_"+str(i)+"minus_col_"+str(48+i)] = col_A - col_B
X_test_dataframe_enlarged["col_"+str(i)+"plus_col_"+str(48+i)] = col_A + col_B
X_test_dataframe_enlarged["col_"+str(i)+"dot_col_"+str(48+i)] = col_A * col_B
X_test_dataframe_enlarged
enlarged_voting_gbc_X_train, enlarged_voting_gbc_X_valid, enlarged_voting_gbc_y_train, enlarged_voting_gbc_y_valid = train_test_split(X_dataframe_enlarged, y, test_size=0.2, random_state=57)
%%time
estimators = [
('xgbc', XGBClassifier(booster='gbtree', learning_rate=0.3,
max_depth=6, n_estimators=103,
colsample_bynode=1, colsample_bytree=1,
subsample=1, gamma=0,
objective='binary:logistic', random_state=57)),
('lgbc', LGBMClassifier(objective= 'binary',
n_estimators = 2000, random_state=57)),
('catgbc', CatBoostClassifier(eval_metric= 'Logloss', iterations= 1500,
learning_rate= 0.1, subsample= 0.8, random_state=57))
]
voting_clf_enlarged = VotingClassifier(estimators=estimators, voting='hard')
voting_clf_enlarged.fit(enlarged_voting_gbc_X_train, enlarged_voting_gbc_y_train)
enlarged_voting_clf_y_pred = voting_clf_enlarged.predict(enlarged_voting_gbc_X_valid)
valid_score = criterion(enlarged_voting_clf_y_pred, enlarged_voting_gbc_y_valid)
print('FPR + FNR = {}'.format(valid_score))
The $FPR$ and $FNR$ rate is the lower for this notebook: 0.379. The voting system provides that encompass Three boosting models train on the enlarged give better performance than XGBoost, CatBoost and LightGBM on the binary classification task. We therefore deduce taht the generation of new features seems to improve the performance of the model.
We prepare the submission of the best model by training the model on all the data at our disposal without splitting the dataset.
The best performing model is a voting model that which includes 3 boosting algorithms
XGBoostLightGBMCatBoostWith a majority voting system
%%time
best_estimators = [
('xgbc', XGBClassifier(booster='gbtree', learning_rate=0.3,
max_depth=6, n_estimators=103,
colsample_bynode=1, colsample_bytree=1,
subsample=1, gamma=0,
objective='binary:logistic', random_state=57)),
('lgbc', LGBMClassifier(objective= 'binary',
n_estimators = 2000, random_state=57)),
('catgbc', CatBoostClassifier(eval_metric= 'Logloss', iterations= 1500,
learning_rate= 0.1, subsample= 0.8, random_state=57))
]
best_model = VotingClassifier(estimators=best_estimators, voting='hard')
best_model.fit(X_dataframe_enlarged, y)
We prepare the submission of the model to the data challenge site. We save the model predictions in a text file that will be uploaded on the data challenge website
We make the prediction on the testing dataframe enlarged X_test_dataframe_enlarged
# Classify the provided test data
y_test = best_model.predict(X_test_dataframe_enlarged).astype(np.int8)
np.savetxt('enlarged_voting_y_test_challenge_student_V2.txt', y_test, fmt='%i' , delimiter=',')
| Model | Feature Engineering | Hard Voting | Soft Voting | Score FNR+FPR rate (Valid test) | 1-(FNR+FPR) (Valid Test) |
|---|---|---|---|---|---|
| Adaboost (baseline) | Features Selection | No | No | 0.52 | 0.48 |
| Gradient Boosting | Features Selection | No | No | 0.54 | 0.46 |
| XGBoost | Features Selection | No | No | 0.45 | 0.55 |
| XGBoost | Initial features | No | No | 0.40 | 0.60 |
| LightGBM | Initial features | No | No | 0.39 | 0.61 |
| CatBoost | Initial features | No | No | 0.387 | 0.613 |
| XGBoost + LightGBM + CatBoost | Initial features | Yes | No | 0.382 | 0.618 |
| XGBoost + LightGBM + CatBoost | Initial features | No | Yes | 0.384 | 0.616 |
| XGBoost + LightGBM + CatBoost | Enlarged features | Yes | No | 0.379 | 0.621 |
| Neural Network | Initial features | No | No | 0.44 | 0.56 |
For this data challenge I followed the following classic machine learning steps:
1/ Data investigation
2/ Data preprocessing
- drop duplicate elements
- convert type of columns
3/ Features selection
- select the best features to simplify the classifiction task
4/ Apply Machine learning algorithms
5/ Apply a neural network
6/ Generate new features
7/ Fit the best model
8/ Predict the label with the best model
We could see that the features selection did not have the expected effect, since when I train a model on the whole dataset without going through the feature selection step, it gives better results as we could see with the model XGBoostand the others boosting models.
Overall, the boosting models provide the best performance. The best performing boosting model for this data challenge seems to be the CatBoost algorithm. But I managed to improve the performance by doing a majority vote between XGBoost, LightGBM, CatBoost
In addition, I managed to further improve the performance of my model by creating new features via linear combinations between features. Extending the number of explicative variables of the input dataset seems to improve the performance of the model
Moreover Recently, deep learning models are state-of-the-art models for image classification tasks, so one could easily deduce that neural networks are the best performing model for this data challenge. However, it seems that this is not the case, as the neural network I implemented does not produce better performance than the boosting models.
Through this data challenge, I realized that refining the hyperparameters of machine learning models is not an easy thing. However, when hyperparameters are well chosen, they usually lead to better performing models. Also, it was noticed that boosting models are better when they work together thanks to and when they are combined with a majority voting system.